Keywords

1 Introduction

Distributed control systems are continuously gaining importance, as more and more devices and machines are equipped with embedded systems that control their operation. Computers in these control systems are increasingly more powerful and networked, providing intelligence and interoperability. Examples of such systems range from large mobile machines to groups of robots and intelligent sensor networks. These cyber-physical systems (CPSs) interact with environment and physical processes, influencing many parts of our lives either directly or indirectly. Therefore they need to be dependable, which can be measured with the attributes of availability, reliability, safety, integrity and maintainability [1]. However, with the increased functionality and intelligence, the complexity of these systems is also increased, meaning that the development process becomes more demanding and dependability becomes more costly to achieve and verify. Another significant feature of CPSs is that they often have strict timing constraints, which may put limitations on the architecture.

Many critical systems that have failed catastrophically are well-known – examples such as Therac-25 radiation therapy machine and the explosion of Ariane 5 rocket are infamous, whereas highly reliable systems receive little recognition, even though their study might give valuable ideas for the design and architecture of new software. One example of such systems can be found in telephony applications, namely Ericsson AXD301 Asynchronous Transfer Mode (ATM) switches that achieved nine nines (99.9999999%) service availability, running software written in Erlang [2]. Erlang’s highly decoupled actor model and fault handling based on supervisors have inspired especially. Let it crash and Service manager patterns found in this paper.

This paper presents three software patterns that can be used to improve control system dependability – the third pattern is called Data-centric architecture – and shows how they fit in the existing literature by addressing the specific needs of CPSs. The approach promoted by these three patterns is based on implementing a decoupled architectural design with supporting fault mitigation and handling. The decoupled architecture can also be used to gradually introduce additional fault tolerance solutions such as checkpointing and rejuvenation to the system, until a sufficient level of reliability has been achieved [3]. Our patterns were originally encountered in the research of remote handling control systems for robotic manipulators, but all patterns have examples of other known uses as well. These examples are presented in the corresponding sections of the patterns.

One reason why development of CPSs is difficult is because the systems typically consist of dynamic service chains that operate on wide range of platforms, which complicates management of end-to-end deadlines. Moreover, modern middleware provide capabilities to flexibly change service deployment on these subsystems, but some configurations may be inefficient or even unusable if communication links become overloaded. While adaptability has benefits, these uncertainties nevertheless complicate assurance of reliability and predictability of the system. Therefore, CPSs benefit from a design that makes the overall system more robust, whereas more traditional fault tolerance solutions, such as hardware redundancy, are arguably better suited for static safety-critical subsystems.

Data-centric approach is one way to increase decoupling between communicating units. However, data-centric design as a central communication paradigm, as well as the concept of CPS, is still fairly novel in the domain of distributed control systems. Although control systems are by nature data-centric (read sensor data and desired output, send actuator command, etc.), this has usually been from point A to point B. The patterns in this paper capture some of the ways that reliability-related challenges faced in developing more intelligent and adaptable distributed control systems have been solved. Next chapter shows how our patterns fit the gaps in the existing pattern literature, by addressing needs specific to CPSs.

2 Context of the Patterns

Fault tolerance cannot be implemented without redundancy of some kind. To have fault tolerance for e.g. computer failures, we would need at least two computers – if one fails the other one can detect the error and try to correct it. Software faults on the other hand are typically development faults, which are harder to detect and correct than hardware faults. To have good coverage for software faults, diverse redundancy (e.g. N-version programming) is needed, but it has been criticized of being susceptible to common mode failures [4]. Moreover, development costs for design diversity are often seen as prohibitive.

Patterns in this paper present an alternative approach to fault tolerance, based on dividing the system into highly decoupled modules and implementing lightweight form of fault tolerance. We present an architectural pattern called Data-centric architecture as one way to achieve a high level of decoupling. One of the key points of decoupling is that it should by itself improve reliability by limiting fault propagation and improving modularity and understandability of the system. In a way, modular approach can be seen similar to compartmentalization of ships – without compartments, every leak can sink the ship. An example of a software system that uses modularity to successfully implement fault isolation and resilience is the MINIX 3 operating system released in 2005 [5]. Driver management of MINIX 3 is presented as one of the known uses of Service manager.

Modular and decoupled architecture can also be used to implement other reliability-improving patterns like Service manager and Let it crash documented in this paper or other well-known patterns like Leaky bucket counter [6], Watchdog [6, 7], etc. The short descriptions of the patterns presented in this paper are listed in the Table 1. List of all referenced patterns with descriptions can be found in an appendix.

Table 1. Pattern descriptions

Data-centric architecture provides the decoupled architectural model needed to use Let it crash for fault handling. The Service manager pattern provides a way for trying recovery after failures, in addition to providing error detection and monitoring. The idea of crashing a process suggested by Let it crash may sound like a risky action to take. However, the idea is to offer recovery from transient physical and interaction faults (sometimes called Heisenbugs), ability to keep the system as a whole functioning, even if some internal process would crash, and possibility to hot-swap code and bug-fixes. The downside of this approach is of course that it is not suited for fail-operate systems such as flight controllers that must be operational all the time – this type of systems would be the right domain to apply design diversity.

In order to show how these patterns fit the existing literature, we have built a pattern language for fault tolerance in CPSs that references related patterns and pattern languages, shown in Fig. 1. Entry point to the language is the need for introducing fault tolerance to the system in order to improve its dependability. The three main starting points are Minimize human intervention [6], Redundancy [6] and Units of mitigation [6], but the Redundancy branch has been not been explored in-depth since it presents somewhat different approach from the three patterns found in this paper. Recovery types have also been condensed to a single concept. Some of the connections presented in the original sources have been reorganized in order to better fit in this context, and the figure shows only one of the possible combinations of the patterns. Connections to other patterns and pattern languages can be checked from the references in Table 2 found in the appendix.

Fig. 1.
figure 1

Pattern language for fault tolerance in cyber-physical systems

The pattern language shows how the patterns presented in this paper build on top of existing patterns and support implementing fault recovery and Safe state [7] in CPSs. Gaps identified in the pattern language are related to CPSs being networked systems with real-time requirements and safety concerns. Fault handling needs extra attention since control system cannot try complex fault recovery routines that could have unforeseen consequences. Instead, a better approach is to Quarantine [6] the faults locally and stop their propagation, even if that would mean losing some functionality either temporarily or permanently.

There are several existing patterns that have similar purposes as Service manager, such as Fault observer [6], Replica manager [15], Service configurator [16] and System monitor [6]. However, CPSs benefit from more active management component that can try to react to the failures within system specifications – because they typically have timing-critical control loops and state machines – to mitigate faults and stop their propagation in the system.

Finally, to implement the fault handling, units need a loosely coupled architecture that is robust to failures and supports fault detection. The patterns in the pattern language work together by building on the top of features provided by other patterns as shown in Fig. 1 but all of the patterns can also be used in other contexts besides distributed control systems. Other well-known fault tolerance patterns also work well in combination with the presented patterns. Besides the patterns presented here, other typical examples related to reliability of CPSs include implementation for fault detection, fault reporting, sending and acknowledgement of commands, etc. but have been left out of this paper.

3 Patterns

3.1 Data-Centric Architecture

figure a

Intent.

Implement an architecture based on autonomous modules (e.g. services, processes or applications) that communicate by sharing properly modeled data.

Context.

You are developing a distributed control system that consists of several subsystems and needs to interact with other heterogeneous systems such as mobile machines or plant systems. The system has CPU and memory resources available to run an operating system – rather than being based on a basic time-triggered scheduler used in resource-constrained embedded systems. Failures in control functions (e.g. boom or manipulator control) may cause damage to the environment and equipment, meaning that some subsystems may be categorized as safety or mission-critical.

Problem.

How to implement a reliable and scalable distributed control system?

Forces

  • Throughput: Some time-critical data such as sensor measurements may be updated with short period, producing large amounts of communication.

  • Scalability: New nodes and subsystems can join the system any time; assumptions about interfaces between modules should be minimized.

  • Changeability: System configuration and functionality might change. Changing interfaces in a tightly coupled system requires code changes at both ends (and at all clients), so assumptions about expected behavior should be minimized. Point-to-point protocol based client-server architectures (e.g. sockets or remote method invocation) are not ideal because of complexity and coupling introduced.

  • Maintainability and long expected life-cycle: The control system has long expected lifetime and needs to be maintainable and extensible in the future – if subsystems are added or substituted, changes to existing modules need to be minimized. System should be easy to understand and modify without breaking it.

  • Maintainability: Implementing custom communication channels and protocols should be avoided.

  • Reusability: Same modules could be used in other control system implementations.

  • Interoperability: Distributed control systems consist of and/or need to communicate with heterogeneous platforms.

  • Testability: Tightly coupled modules are difficult to test because they are more dependent on other modules.

  • Availability: The system as a whole should remain available, even if some subsystems or processes experience failures.

  • Reliability: A single fault in the control system software should not endanger functionality of the whole system (i.e. no single point of failures).

  • Reliability: Faults should be detected and their propagation prevented.

  • Real-time performance: Control system interacts with the real world and needs to react in a deterministic manner.

  • Safety: Need to detect if a module has crashed or is down (not releasing new information) so that the system can enter Safe state in a controlled fashion. Safety-critical and non-safety-critical subsystems cannot be tightly coupled, since errors may propagate.

  • Quality of service: Different subsystems may have different requirements for quality of serviceFootnote 1 (QoS) policies. There is an impedance mismatch between e.g. real-time control systems that operate on a timescale of milliseconds and enterprise/high level systems that are several orders of magnitude slower.

Solution.

Build the system from autonomous modules that communicate by sharing data that is based on a well-designed and consistent data model.

Implement communication between modules as sharing of data, instead of sending point-to-point messages or request-reply service calls. Data-centric approach is based on minimizing dependencies between modules by removing direct inter-module references and hiding module-specific behavior. This can be achieved by delegating data-handling to a middleware solution that supports publishing of data to topics in a distributed data space and making applications tolerate unavailability of dependencies. Asynchronous messaging is a well-known way to reduce coupling of systems, but data-centric approach increases this further by removing the concept of recipient from the publisher.

Modules should be built to be autonomous and not expect that other services are always started in a specific order and available. Service/module composition may change during runtime; there are patterns for managing the configurations (e.g. Service configurator). Developer should avoid assumption about state of the dependencies, i.e. other services. Dependencies may not always be available and this must be taken into account in the application code so that the service will react accordingly if its dependency is down because it is in the process of starting, failed, manually shut down, etc.

Management of the global data space is externalized to the middleware that implements a topic-based Publish/subscribe model. Middleware disseminates data to all participating nodes, acting as a single source of up-to-date system-wide state information. It acts as a single source of up-to-date state information in the system, instead of applications managing state separately.

Modules do not need to know recipients of the data when publishing it, which reduces coupling. Instead of sending data directly to a recipient, it is published to a topic. Data can be e.g. sensor measurements, events or commands, but it must follow a shared information model which is represented as topics in the actual system implementation. Publishers register as data writers to a topic and interested subscribers can join the topic as data readers. Single topic can have multiple instances, which are identified by a key value, and can have multiple readers and writers, as shown in Fig. 2.

Fig. 2.
figure 2

Data is published to topics that can have multiple data writers and readers. Topic A has two instances, identified by the id number key value.

Since the middleware decouples the modules, publisher might assume that a subscriber is listening when it is not. If a publisher needs to know that data has been received, it should monitor status of the subscriber (published to another topic). This might be true, for example, with commands sequences where commands must be completed before sending the next one.

Instead of designing callable methods for components, you must design how to represent the state of the system and the external or internal events that can affect it. This is captured in a common data model, which contains the essential elements of the physical system and application logic. Conceptually the data model is similar to class diagram in object-oriented programming since it consists of identifying entity types, which have data attributes assigned to them, and associations. The difference is that the data model focuses on data instead of behavior. Data model ensures that communication between modules is unambiguous and interoperable. Appropriate QoS attributes can also be attached to the data model.

Communication and application logic are separated since network communications are delegated to a “data bus” formed by the publish/subscribe middleware (Fig. 3), so that the application logic can focus on the core functionality. Middleware takes care of maintaining the data up-to-date, automatically updating new nodes that join. If the middleware uses a central server as a message Broker [8], it becomes a single-point-of-failure and possibly a bottleneck. Therefore, choose a decentralized middleware solution, if possible, to avoid this problem.

Fig. 3.
figure 3

Middleware implementation as a virtual data bus that has no central components or brokers. Services and subsystems can join topics as publishers and/or subscribers.

Granularity of modules and interactions are important design decisions that affect failure consequences, performance and reusability of modules. Fine-grained autonomous modules (large number of smaller modules) are easier to reuse and make it easier to isolate faults, but limiting the number of modules and interactions helps to avoid potential performance issues. Modules communicating only locally can be more fine-grained than ones communicating remotely, although the data model should not include location dependencies. Fine-grained interactions give more flexibility, as it will be possible to treat data items separately. Coarse-grained interactions are usually preferred between remote modules in order to avoid overhead, but data that is updated rapidly should be separated from data with slow update rates in order to avoid unnecessary use of bandwidth. Further control over system granularity can be achieved by dividing it to domains.

Compared to message-centric publish/subscribe, one of the differences in data-centric model is that data samples published to topics are transparent to the middleware. In message-centric model, middleware does not know or care about message contents and communication is point-to-point by nature which introduces coupling between modules, although some message-centric middleware also support publishing of messages to topics. Data-centric communication is based on a data model that expresses the state of the system. Since data is interpreted through the model, it is platform-independent and middleware can prioritize, filter and manage the data based on its contents and QoS policies, replacing part of the application logic. Although developing a data model adds to upfront planning efforts, systems with long-term lifecycles benefit in terms of maintainability and evolvability.

Consequences

+:

Publishers do not need to know about subscribers.

+:

Interoperability between heterogeneous platforms since data is interpreted through the data model.

+:

Decoupled design provides error confinement and other benefits such as improved maintainability.

+:

Modules can be changed dynamically because late joiners receive new data automatically; ability to hot-swap code can be easily implemented.

+:

An application or subsystem can be shut down without impacting the overall operation of the system.

+:

Network transport layer is abstracted as communications are externalized to middleware, which reduces communication related code and simplifies implementation.

+:

Gives developers control of data delivery with QoS management; QoS can be used e.g. to guarantee reliable delivery (eventually) or that available data is kept up-to-date with best effort. Former would be useful for sending status changes or commands, whereas latter could be used for sensor measurement for which guaranteeing delivery of outdated samples makes no sense.

+:

Reusability is improved since modules are using shared memory and have their own namespaces, etc.

+:

Publish/subscribe based middleware scales effectively since recipients for data are not explicitly defined.

+:

Performance gains can be achieved on multi-core machines since modules can be easily parallelized and they communicate asynchronously.

±:

Needs good and consistent data models that must be managed and maintained, but a well-thought-out data model improves maintainability and makes reuse of the code easier.

−:

A publisher might assume that a subscriber is listening when it is not.

−:

Sending of commands is not as straightforward as in client-server architectures since commands need to be parsed from the data. However, interactions can be modeled as operation codes sent between two modules.

−:

Parsing of data complicates debugging because it adds another potential source for faults. If data is parsed incorrectly, origin of fault may not be self-evident.

−:

Extra code needed when compared to more monolithic applications since modules cannot presume that all dependencies are started in specific order and available all the time.

−:

Serialization and deserialization of the data structures for transmission may add overhead.

−:

Faults in the middleware itself complicate testing and are hard to detect.

−:

Middleware solutions add some overhead to message size and use system resources.

−:

Possible vendor lock-in to the middleware provider

Known uses.

Data Distribution Service for Real-Time Systems (DDS) is decentralized and data-centric middleware based on the publish/subscribe model. DDS is aimed at mission-critical and embedded systems that have strict performance and reliability requirements. Therefore, its implementations have typically been optimized and tested to suit the needs of these systems. DDS is used as the information backbone in the Thales TACTICOS naval combat management system that integrates various subsystems such as weapons, sensors, counter measures, communication, navigation, etc. to a “system of systems”. Applications are distributed dynamically over a pool of computers in order to provide combat survivability and avoid single-point-of-failures. System configuration can be adapted for use in various mission configurations, on-board & simulator training, and different ship types.

Related Patterns.

Bus abstraction [7], and Publisher-subscriber.

Mediator [9] increases decoupling in a similar fashion, but is designed to decrease connections between objects locally.

Decoupled modules in Data-centric architecture act as Units of mitigation, parts that contain errors and error recovery.

3.2 Service Manager

figure b

Also Known as.

Supervisor.

Intent.

Service manager starts, stops, and monitors processes locally and takes care of resource allocation for systems that need high availability and real-time performance.

Context.

You are developing a system with highly decoupled architecture (e.g. using Data-centric architecture) that consists of large number of processes or tasks (services). These processes have dependencies and therefore need to be started in specific order. Process composition may change dynamically during runtime because your system will have intelligent functionality, it needs to adapt to new situations, or different functionalities need to be tested without stopping/restarting the whole system.

You know rough upper-limit estimates for how much system resources such as memory and CPU time the processes will use.

The system has long expected life-cycle. It is likely to be deployed on a remote location, for example a forest or a control cubicle, making direct physical interaction with the system a bothersome task.

If you have a real-time operating system and a task gets stuck in a while loop or some other control structure, it freezes the whole system as other lower priority processes (including input devices and network connections) cannot get CPU time. In this case, the only option is usually to restart the whole computer manually.

Problem.

How to ensure that all dynamic modules in your control system are running correctly and you have enough system resources to achieve deterministic real-time performance?

Forces

  • Availability: The system as a whole should remain available, even if some subsystems or processes experience failures, in order to able to use other parts of the system that are not connected to the failed subsystem. The system must detect faults and try to mitigate them automatically. If a failure needs immediate reaction from a human operator, the system will not scale cost-efficiently and reliably.

  • Data logging/testability: If a process fails, the failure should be detected and logged.

  • Real-time performance: The control system needs to respond in a deterministic and predicable manner. Predictability includes system behavior when a fault is triggered.

  • System resources: Control systems are typically deployed on embedded devices that have limited memory and CPU resources available. They may need to be monitored in order to guarantee the real-time performance of the system.

Solution.

Implement a service manager that can monitor, start and stop local modules.

Create a local parent process (the service manager) that is responsible for starting, stopping and monitoring its child processes. The basic idea of the service manager is to keep its child processes alive by restarting them when necessary. Location of the service manager is on the same computer as the child processes in order to keep implementation simple. Therefore, all computers in the system need their own, independently functioning, service managers. The service manager is given the highest process priority in the system or put in the kernel so that a faulty real-time process cannot prevent it from functioning by consuming all available CPU time.

Start the child processes based on a fixed order or a dependency table read from a configuration file, similar to Start-up monitor [7], and/or implement a user interface that can be used to start and stop processes.

Use the service manager to allocate resources like CPU time and memory for the child processes and monitor their use. Expected maximum resource consumption can be specified in the same configuration file that is used for starting services. New processes are not started if there are not enough resources available. If a process consumes more resources than expected, it can be restarted, triggering error handling according to the Let it crash pattern. Resource use can be followed e.g. with proc filesystem or getrusage call in Unix-like systems.

Since one of the key functionalities of service manager is to monitor processes for failures, error detection can be based on additional or alternative techniques besides resource monitoring. This can be done with e.g. operating system features, Heartbeat [6, 7] or Watchdog.

If fault recovery fails, service manager should mitigate the fault by Quarantining the faulty module. If the fault is persistent, Leaky bucket counter can be used to limit the number of restarts.

If the service manager is deployed on a system that uses Data-centric architecture, service startup interfaces can be implemented through the middleware. Since the middleware abstracts the location of the data, it can be used to remotely start dependencies. For example, service manager SM_A must start a service called S1. However, it has a dependency called S2 which cannot be found locally, so the service manager publishes a start request for S2. A second service manager SM_B on another computer notices the request, starts S2 and publishes information about the successful startup. SM_A receives information that S2 is available and starts S1.

The implementation for service manager needs to be kept fairly simple, since it acts as a single point of failure locally. This conflicts with the need to use of configuration files, making resource checks, and providing user interface, so they should be based on external components or libraries that have been proven in use.

Consequences

+:

Detects and initializes recovery from transient faults that cause a process to consume too much system resources or become unresponsive.

+:

Ensures other processes stay alive and have sufficient resources.

+:

Simplifies starting procedure of complex system that consists of large number of processes, making possible to start and stop a large number of processes automatically and in a specific order.

+:

Cost-efficiency: the same service manager implementation can be reused on several systems.

+:

Supports logging and reporting of errors so that they do not go undetected.

−:

Cannot detect faults that cause erroneous output for monitored components.

−:

Cannot recover persistent faults such as development and physical faults, e.g. computer failures.

−:

Potential single point of failure that may stop the entire system from working if services are incorrectly terminated.

−:

Restarting a service may cause the system to behave in non-deterministic way and miss deadlines, which is a failure for a hard real-time system. However, it should be noted that the failure would have likely caused the system to miss the deadlines or exhibit some other unwanted behavior even without service restart.

−:

Resource utilization needs to be estimated for the processes in order to set limits.

−:

Service manager uses system resources and may reduce performance

Known uses.

Node State Manager (NSM) for in-vehicle infotainment systems: GENIVI Alliance (http://genivi.org/) is a non-profit consortium promoting open-source platform for the automotive in-vehicle infotainment industry. Reference implementation of the platform includes NSM that is responsible for information regarding the current running state of the embedded system. NSM component collates information from multiple sources and uses this to determine the current state of the node. It is the highest level of escalation on the node and will therefore command the reset and supply control logic. It is notified of errors and other status signals from components that are responsible for monitoring system health in different ways. NSM also provides shutdown management by signaling applications to shut down.

MINIX 3.0 driver manager: MINIX is a POSIX conformant operating system, based on a microkernel that has minimal amount of software executing in the kernel mode. Most of the operating system runs in user mode as independent processes, including processes for the file system, process manager, and device drivers. The system uses a special component known as the driver manager to monitor and control all services and drivers in the system [5]. Driver manager is the parent process for all components, so it can detect their crashes (based on POSIX signals). Additionally the driver manager can check the status of selected drivers periodically using Heartbeat messages. When a failure is detected, the driver manager automatically replaces the malfunctioning component with a fresh copy without needing to reboot the computer. The driver manager can also be explicitly instructed to replace a malfunctioning component with a new one.

Monit (http://mmonit.com/monit/) is an open source tool that can function as a service manager in non-real time systems. Following code listing shows an example configuration for Spamassassin daemon that restarts the daemon if its memory or CPU usage exceeds 50% for 5 monitoring cycles:

figure c

Related Patterns.

Fault observer [6], Heartbeat, Safe state, Someone in charge [6], Start-up monitor, Static resource allocation [7], and Watchdog.

To see how to design an application in a way that it can be easily restarted at any time, see Let it crash.

Manager design pattern [10] can be used to manage multiple objects of same type – the idea is similar to Service manager (keep track of entities and provide unified interface for them) but the Manager focuses on different scope, i.e. managing entities (objects) of the same type and does not include resource monitoring or fault detection.

Service configurator is very similar to Service manager in many regards. However, the main use cases for Service configurator are, as the name implies, related to reconfiguration of the system, whereas Service manager aims to improve fault tolerance of the system by managing (monitoring & restarting) services. In CPSs, dynamic reconfiguration of the system can often be undesirable due to possible safety implications. An example of Service configurator is the device driver system in modern OSs. A comparable implementation of the to Service manager is the driver manager in MINIX, which adds the management (fault detection & restart) aspect to device drivers.

Service manager can Quarantine a module by stopping it if a fault is detected. and recovery does not work.

System monitor [6] can be used to study behavior of system or specific tasks and make sure they operate correctly, e.g. by using Heartbeat or Watchdog. If a monitored task stops, System monitor reports the error. Compared to it, Service manager has a more active role in managing the tasks.

Replica manager [15] provides the necessary mechanisms for the replica management in systems that use active node replication, i.e. Redundancy, whereas Service manager does not make presumptions about the use of redundancy.

3.3 Let It Crash

figure d

Also Known as.

Crash-only [11], Fail-fast, Let it fail or Offensive programming.

Intent.

Avoid complex error handling for unspecified errors. Instead, crash the process and leave error handling for other processes in order to build a robust system that handles errors internally and does not go down as a whole.

Context.

You are developing a distributed control system that consists of several processes and subsystems that need to cooperate to complete tasks.

Data-centric architecture or some other asynchronous decoupled architectural design has been utilized so that processes are not using shared memory.

Some subsystems might have safety-critical functionality, but it is possible to move the system to Safe state (i.e. the system is fail-safe type, not fail-operate). The system has dynamic state information from the user inputs and working environment in the process memory, e.g. tool tracking data in the case of a robot manipulator. This state data needs to be recovered after a failure.

The system has a mechanism to supervise and restart the processes. This can be implemented at operating system, programming language or framework level, e.g. with the Service manager.

Problem.

How to implement lightweight form of error handling that improves reliability and predictability?

Forces

  • Availability: The system as a whole should remain available, even if some subsystems or processes experience failures, since degraded functionality is better than no functionality. In case of a fault, only minimal part of the system should be affected. Recovery from failures should happen without human intervention and with minimal downtime.

  • Reliability: Generation of incorrect outputs should be prevented, otherwise errors may propagate and the system could cause damage to the environment.

  • Safety: If an error is detected, any functionality using the affected process should be stopped and taken to a safe state in order to prevent and minimize damages.

  • Cost-efficiency: Design diverse fault tolerance techniques are oversized or impractical for the application, but the system needs to be able to recover from errors.

  • Real-time performance: Control system needs to react within a certain time-limit; exceeding the time-limit causes a failure.

  • Predictability: The system should behave in a consistent manner. If the process tries to repair its corrupted state, behavior of the system cannot be predicted, which complicates debugging and verification of reliability. Predictability includes system behavior when a fault is triggered.

  • Recovery: Because it is impossible to foresee all possible faults, specifications do not cover all possible error situations. Various error situations occur seldom, are difficult to handle and non-trivial to simulate in testing [11]. If the programmers try to implement recovery, they will make ad hoc decisions not based on the specifications (i.e. they cannot know how the error should be handled), possibly causing unwanted and undocumented behavior.

Solution.

Make processes crash-safe and fast to recover; flush corrupted state by “crashing” the process instead of writing extensive error handling code.

Commodore 64, DOS machines and other old computers were designed to be shut down by simply turning the power off, essentially crashing the system. On the other hand, if an operating system caches disk data in memory, workstation crash may corrupt the file system, which is inconvenient and slow to repair. Control system processes and subsystems should also be designed to be easily terminated and recoverable with a simple recovery path if an error is detected, instead of guessing how error recovery should be attempted, possibly corrupting program state further and causing unpredictable behavior.

Therefore, implement error handling by terminating the process that has encountered the error. Only program extended error recovery routines if they are based on the specification or it is self-evident how the error should be handled – otherwise crash the process. However, only the module or process where the error is should be crashed, not the whole system.

Processes that have been designed with Let it crash can (1) help to find faults, by making them more visible (“offensive programming”), (2) prevent software degradation with Rejuvenation [11, 14], and (3) be used to implement fault tolerance (recovery from faults). In the final case it is possible to perform recovery without affecting service availability if the recovery process is fast enough. Recovery (and rejuvenation) needs an external entity to initiate the procedure, since the process itself has crashed (see Fig. 4). This pattern focuses mostly on the final case since it is more problematic to implement correctly.

Fig. 4.
figure 4

Process 1 encounters an error and dies, after which it is restarted by the service manager, represented as an eye. If the process 2 detects a deadline overrun, it needs to stop, potentially interrupting process 3, and wait until process 1 is active again before resuming work. Alternatively the process 2 does not notice any deadline overruns and continues working normally.

You have a monitoring layer that can supervise and recover processes e.g. by restarting. To have the monitoring layer detect a failure, you may need to implement timeouts or the faulty process must terminate upon encountering an error in order to send a signal for the monitoring layer (parent process knows the liveliness state of its child processes). How the error is detected in the first place is not part of Let it crash, but contract programming or error checks could be used. Abnormal program termination can be forced e.g. by using abort() or raise(SIGSEGV). If the monitoring layer has implemented failure detection – based on watchdog, heartbeat, etc. – it can also hard-fail the service using e.g. kill(pid, SIGTERM). This might be necessary if the process is incapable of detecting its own fault.

Error recovery is performed by restarting the process. Therefore, make processes fast and easy to restart in order to minimize service failures and downtime. To keep recovery path simple, use the single responsibility principle, thereby minimizing responsibilities of a single process. If the process encounters an error and crashes, it might be possible to recover from the error without causing deadline misses for other processes and tripping the system to a Safe state. However, if a control loop has a period of e.g. 1 ms and restarting of a process that provides information for the loop takes several milliseconds, control loop execution will be interrupted.

Let it crash does not mean that error handling or exception handling should not be implemented at all. Indeed, sanity checks and error handling are essential for control systems and should be implemented to prepare for exceptional (but expected) circumstances, such as write operation failures or unavailable dependencies. Let it crash, on the other hand, is applicable in situations where the program experiences an unexpected failure and cannot reliably perform its function. This can happen due to programmer errors, complex interaction faults, intermittent faults, etc.

Recovery paths can be tested extensively by terminating the system forcibly every time it needs to be shut down or restarted, instead of letting it run through a normal shutdown process. This forces the system to do a recovery during the startup.

Make processes crash-safe. Processes typically handle three types of state data: dynamic, static, and internal. Internal state is related to current computations and is usually discarded after use. If a process crashes, you must think if you want to recycle its internal state. If you recycle everything you risk hitting the exact same fault again and crashing, so it might be reasonable to recycle only parts of this state. Static state is configuration data that can be easily recovered or read from other processes. Finally, the dynamic state data is generated as the program is executed by reading user inputs, interacting with other processes and environment, etc. Some of it can be computed from other data or read directly from sensors, but rest cannot be reconstructed. This data must be protected by using checkpointing, journaling or some other form of dedicated state store, for example databases and distributed data structures. To implement this, you must know What to save [6].

Implement a reporting functionality that reports failures so that they do not go unnoticed. Failure information can be forwarded e.g. by using a service manager or supervisors to send Notification messages [12].

The corollary to the Let it crash approach is that you must design your software to be ready for processes failing. There is now a possibility that a dependency is not available because it has been crashed and is being restarted. To detect this situation, add timeouts or appropriate QoS policies to interactions between components. If a timeout is triggered, move the system to a Safe state. Normal operation can be resumed when dependencies are back online. A missing dependency is therefore not considered to be an error that would necessitate a crash.

Consequences

+:

Enables simple error handling & recovery; avoids complex error handling constructs in code, therefore improving predictability of the system.

+:

Cost-effective (lightweight) form of fault tolerance that does not require use of redundancy.

+:

Allows error handling to be implemented separately (externally) from the business logic, e.g. with supervisors.

+:

Supports recovery from transient faults since a restart is usually enough to handle them.

+:

Possible to achieve high availability (for the system as a whole, not necessary for all services provided by the system).

+:

Complements other fault tolerant designs such as Redundancy and Rejuvenation.

+:

Processes can be updated to new versions on-the-fly, since the old process can be killed and replaced using the normal recovery path.

+:

Limits error propagation to other parts of the system (babbling idiot failure) by acting as an Error containment barrier [6].

+:

Errors are less likely to cause the system to perform unpredictable and potentially dangerous or irreversible operations.

+:

Finding faults should be easier, since they are made more visible by crashing and reporting.

−:

Availability of some services provided by the system is lower (when compared to redundant fault tolerance solutions) – on the other hand availability of other unrelated services provided by the system should be unaffected.

−:

Cannot mitigate persistent faults.

−:

Processes need additional code to react to missing dependencies (i.e. other services, when waiting for them to come back online).

−:

Possible performance cost if state needs to be saved to enable recovery.

−:

Recovery speed is non-deterministic since it depends on how fast the processes can be restarted, loading of saved state, loading of dependencies, system load level, etc

Known uses.

Erlang actor model and supervisors (Erlang is used e.g. in Ericsson AXD301 ATM switches) [2]: supervisors are processes that are responsible for starting, stopping and monitoring their child processes. The basic idea of a supervisor is that it should keep its child processes alive by restarting them when necessary [13].

Control system of Curiosity: Mars rovers are highly autonomous vehicles that operate in high-radiation environment, relying on a low-bandwidth, high-latency communication link. A warm reset can be executed by control system when it identifies a problem with one of its operations. On November 7, 2013 Curiosity rover performed a reset of its control software upon encountering an unexpected event (an error in a catalog file) [17]. After the reset, rover entered safe mode, but was able to perform operations and communications as expected and successfully resumed nominal operations mode after the fault had been analyzed.

Related Patterns.

Error containment barrier, Notifications, Safe state, Service manager, Redundancy, What to save.

Minimize human intervention (MHI) is about how the system can process and resolve errors automatically before they become failures [6]. Let it fail could be implemented as part of MHI as a final resort or in case there is no specification for error handling.

Software Rejuvenation is a proactive technique where the system has been designed to be booted periodically. Microrebooting [11] refers to a technique where suspect components are restarted before they fail.