Patterns for Light-Weight Fault Tolerance and Decoupled Design in Distributed Control Systems

Alho, Pekka; Rauhamäki, Jari

doi:10.1007/978-3-030-14291-9_1

Pekka Alho¹⁸ &
Jari Rauhamäki¹⁹

Part of the book series: Lecture Notes in Computer Science ((TPLOP,volume 10600))

530 Accesses
2 Citations

Abstract

Distributed control systems comprise networked computing units that monitor and control physical processes in feedback loops. Reliability of these systems is affected by dynamic and complex computing environments where connections and system configurations may change rapidly. Diverse redundancy can be effective in improving system dependability, but it is susceptible to common mode failures and development costs for design diversity are often seen as prohibitive. In this paper we present three patterns that can be used to provide light-weight form of fault tolerance to improve system dependability and resilience by providing ability to cope with unexpected events and faults. These patterns are presented together with a pattern language that shows how they relate to other fault tolerance patterns.

Access provided by Autonomous University of Puebla. Download chapter PDF

Distributed Frames: Pattern-Based Characterization of Functional Requirements for Distributed Systems

Fault-Aware Application Management Protocols

Building a Digital Twin Framework for Dynamic and Robust Distributed Systems

Keywords

1 Introduction

Distributed control systems are continuously gaining importance, as more and more devices and machines are equipped with embedded systems that control their operation. Computers in these control systems are increasingly more powerful and networked, providing intelligence and interoperability. Examples of such systems range from large mobile machines to groups of robots and intelligent sensor networks. These cyber-physical systems (CPSs) interact with environment and physical processes, influencing many parts of our lives either directly or indirectly. Therefore they need to be dependable, which can be measured with the attributes of availability, reliability, safety, integrity and maintainability [1]. However, with the increased functionality and intelligence, the complexity of these systems is also increased, meaning that the development process becomes more demanding and dependability becomes more costly to achieve and verify. Another significant feature of CPSs is that they often have strict timing constraints, which may put limitations on the architecture.

Many critical systems that have failed catastrophically are well-known – examples such as Therac-25 radiation therapy machine and the explosion of Ariane 5 rocket are infamous, whereas highly reliable systems receive little recognition, even though their study might give valuable ideas for the design and architecture of new software. One example of such systems can be found in telephony applications, namely Ericsson AXD301 Asynchronous Transfer Mode (ATM) switches that achieved nine nines (99.9999999%) service availability, running software written in Erlang [2]. Erlang’s highly decoupled actor model and fault handling based on supervisors have inspired especially. Let it crash and Service manager patterns found in this paper.

This paper presents three software patterns that can be used to improve control system dependability – the third pattern is called Data-centric architecture – and shows how they fit in the existing literature by addressing the specific needs of CPSs. The approach promoted by these three patterns is based on implementing a decoupled architectural design with supporting fault mitigation and handling. The decoupled architecture can also be used to gradually introduce additional fault tolerance solutions such as checkpointing and rejuvenation to the system, until a sufficient level of reliability has been achieved [3]. Our patterns were originally encountered in the research of remote handling control systems for robotic manipulators, but all patterns have examples of other known uses as well. These examples are presented in the corresponding sections of the patterns.

One reason why development of CPSs is difficult is because the systems typically consist of dynamic service chains that operate on wide range of platforms, which complicates management of end-to-end deadlines. Moreover, modern middleware provide capabilities to flexibly change service deployment on these subsystems, but some configurations may be inefficient or even unusable if communication links become overloaded. While adaptability has benefits, these uncertainties nevertheless complicate assurance of reliability and predictability of the system. Therefore, CPSs benefit from a design that makes the overall system more robust, whereas more traditional fault tolerance solutions, such as hardware redundancy, are arguably better suited for static safety-critical subsystems.

Data-centric approach is one way to increase decoupling between communicating units. However, data-centric design as a central communication paradigm, as well as the concept of CPS, is still fairly novel in the domain of distributed control systems. Although control systems are by nature data-centric (read sensor data and desired output, send actuator command, etc.), this has usually been from point A to point B. The patterns in this paper capture some of the ways that reliability-related challenges faced in developing more intelligent and adaptable distributed control systems have been solved. Next chapter shows how our patterns fit the gaps in the existing pattern literature, by addressing needs specific to CPSs.

2 Context of the Patterns

Fault tolerance cannot be implemented without redundancy of some kind. To have fault tolerance for e.g. computer failures, we would need at least two computers – if one fails the other one can detect the error and try to correct it. Software faults on the other hand are typically development faults, which are harder to detect and correct than hardware faults. To have good coverage for software faults, diverse redundancy (e.g. N-version programming) is needed, but it has been criticized of being susceptible to common mode failures [4]. Moreover, development costs for design diversity are often seen as prohibitive.

Patterns in this paper present an alternative approach to fault tolerance, based on dividing the system into highly decoupled modules and implementing lightweight form of fault tolerance. We present an architectural pattern called Data-centric architecture as one way to achieve a high level of decoupling. One of the key points of decoupling is that it should by itself improve reliability by limiting fault propagation and improving modularity and understandability of the system. In a way, modular approach can be seen similar to compartmentalization of ships – without compartments, every leak can sink the ship. An example of a software system that uses modularity to successfully implement fault isolation and resilience is the MINIX 3 operating system released in 2005 [5]. Driver management of MINIX 3 is presented as one of the known uses of Service manager.

Modular and decoupled architecture can also be used to implement other reliability-improving patterns like Service manager and Let it crash documented in this paper or other well-known patterns like Leaky bucket counter [6], Watchdog [6, 7], etc. The short descriptions of the patterns presented in this paper are listed in the Table 1. List of all referenced patterns with descriptions can be found in an appendix.

Table 1. Pattern descriptions

Full size table

Data-centric architecture provides the decoupled architectural model needed to use Let it crash for fault handling. The Service manager pattern provides a way for trying recovery after failures, in addition to providing error detection and monitoring. The idea of crashing a process suggested by Let it crash may sound like a risky action to take. However, the idea is to offer recovery from transient physical and interaction faults (sometimes called Heisenbugs), ability to keep the system as a whole functioning, even if some internal process would crash, and possibility to hot-swap code and bug-fixes. The downside of this approach is of course that it is not suited for fail-operate systems such as flight controllers that must be operational all the time – this type of systems would be the right domain to apply design diversity.

In order to show how these patterns fit the existing literature, we have built a pattern language for fault tolerance in CPSs that references related patterns and pattern languages, shown in Fig. 1. Entry point to the language is the need for introducing fault tolerance to the system in order to improve its dependability. The three main starting points are Minimize human intervention [6], Redundancy [6] and Units of mitigation [6], but the Redundancy branch has been not been explored in-depth since it presents somewhat different approach from the three patterns found in this paper. Recovery types have also been condensed to a single concept. Some of the connections presented in the original sources have been reorganized in order to better fit in this context, and the figure shows only one of the possible combinations of the patterns. Connections to other patterns and pattern languages can be checked from the references in Table 2 found in the appendix.

The pattern language shows how the patterns presented in this paper build on top of existing patterns and support implementing fault recovery and Safe state [7] in CPSs. Gaps identified in the pattern language are related to CPSs being networked systems with real-time requirements and safety concerns. Fault handling needs extra attention since control system cannot try complex fault recovery routines that could have unforeseen consequences. Instead, a better approach is to Quarantine [6] the faults locally and stop their propagation, even if that would mean losing some functionality either temporarily or permanently.

There are several existing patterns that have similar purposes as Service manager, such as Fault observer [6], Replica manager [15], Service configurator [16] and System monitor [6]. However, CPSs benefit from more active management component that can try to react to the failures within system specifications – because they typically have timing-critical control loops and state machines – to mitigate faults and stop their propagation in the system.

Finally, to implement the fault handling, units need a loosely coupled architecture that is robust to failures and supports fault detection. The patterns in the pattern language work together by building on the top of features provided by other patterns as shown in Fig. 1 but all of the patterns can also be used in other contexts besides distributed control systems. Other well-known fault tolerance patterns also work well in combination with the presented patterns. Besides the patterns presented here, other typical examples related to reliability of CPSs include implementation for fault detection, fault reporting, sending and acknowledgement of commands, etc. but have been left out of this paper.

3 Patterns

3.1 Data-Centric Architecture

Intent.

Implement an architecture based on autonomous modules (e.g. services, processes or applications) that communicate by sharing properly modeled data.

Context.

You are developing a distributed control system that consists of several subsystems and needs to interact with other heterogeneous systems such as mobile machines or plant systems. The system has CPU and memory resources available to run an operating system – rather than being based on a basic time-triggered scheduler used in resource-constrained embedded systems. Failures in control functions (e.g. boom or manipulator control) may cause damage to the environment and equipment, meaning that some subsystems may be categorized as safety or mission-critical.

Problem.

How to implement a reliable and scalable distributed control system?

Forces

Throughput: Some time-critical data such as sensor measurements may be updated with short period, producing large amounts of communication.
Scalability: New nodes and subsystems can join the system any time; assumptions about interfaces between modules should be minimized.
Changeability: System configuration and functionality might change. Changing interfaces in a tightly coupled system requires code changes at both ends (and at all clients), so assumptions about expected behavior should be minimized. Point-to-point protocol based client-server architectures (e.g. sockets or remote method invocation) are not ideal because of complexity and coupling introduced.
Maintainability and long expected life-cycle: The control system has long expected lifetime and needs to be maintainable and extensible in the future – if subsystems are added or substituted, changes to existing modules need to be minimized. System should be easy to understand and modify without breaking it.
Maintainability: Implementing custom communication channels and protocols should be avoided.
Reusability: Same modules could be used in other control system implementations.
Interoperability: Distributed control systems consist of and/or need to communicate with heterogeneous platforms.
Testability: Tightly coupled modules are difficult to test because they are more dependent on other modules.
Availability: The system as a whole should remain available, even if some subsystems or processes experience failures.
Reliability: A single fault in the control system software should not endanger functionality of the whole system (i.e. no single point of failures).
Reliability: Faults should be detected and their propagation prevented.
Real-time performance: Control system interacts with the real world and needs to react in a deterministic manner.
Safety: Need to detect if a module has crashed or is down (not releasing new information) so that the system can enter Safe state in a controlled fashion. Safety-critical and non-safety-critical subsystems cannot be tightly coupled, since errors may propagate.
Quality of service: Different subsystems may have different requirements for quality of service^{Footnote 1} (QoS) policies. There is an impedance mismatch between e.g. real-time control systems that operate on a timescale of milliseconds and enterprise/high level systems that are several orders of magnitude slower.

Solution.

Build the system from autonomous modules that communicate by sharing data that is based on a well-designed and consistent data model.

Implement communication between modules as sharing of data, instead of sending point-to-point messages or request-reply service calls. Data-centric approach is based on minimizing dependencies between modules by removing direct inter-module references and hiding module-specific behavior. This can be achieved by delegating data-handling to a middleware solution that supports publishing of data to topics in a distributed data space and making applications tolerate unavailability of dependencies. Asynchronous messaging is a well-known way to reduce coupling of systems, but data-centric approach increases this further by removing the concept of recipient from the publisher.

Modules should be built to be autonomous and not expect that other services are always started in a specific order and available. Service/module composition may change during runtime; there are patterns for managing the configurations (e.g. Service configurator). Developer should avoid assumption about state of the dependencies, i.e. other services. Dependencies may not always be available and this must be taken into account in the application code so that the service will react accordingly if its dependency is down because it is in the process of starting, failed, manually shut down, etc.

Management of the global data space is externalized to the middleware that implements a topic-based Publish/subscribe model. Middleware disseminates data to all participating nodes, acting as a single source of up-to-date system-wide state information. It acts as a single source of up-to-date state information in the system, instead of applications managing state separately.

Modules do not need to know recipients of the data when publishing it, which reduces coupling. Instead of sending data directly to a recipient, it is published to a topic. Data can be e.g. sensor measurements, events or commands, but it must follow a shared information model which is represented as topics in the actual system implementation. Publishers register as data writers to a topic and interested subscribers can join the topic as data readers. Single topic can have multiple instances, which are identified by a key value, and can have multiple readers and writers, as shown in Fig. 2.

Since the middleware decouples the modules, publisher might assume that a subscriber is listening when it is not. If a publisher needs to know that data has been received, it should monitor status of the subscriber (published to another topic). This might be true, for example, with commands sequences where commands must be completed before sending the next one.

Instead of designing callable methods for components, you must design how to represent the state of the system and the external or internal events that can affect it. This is captured in a common data model, which contains the essential elements of the physical system and application logic. Conceptually the data model is similar to class diagram in object-oriented programming since it consists of identifying entity types, which have data attributes assigned to them, and associations. The difference is that the data model focuses on data instead of behavior. Data model ensures that communication between modules is unambiguous and interoperable. Appropriate QoS attributes can also be attached to the data model.

Communication and application logic are separated since network communications are delegated to a “data bus” formed by the publish/subscribe middleware (Fig. 3), so that the application logic can focus on the core functionality. Middleware takes care of maintaining the data up-to-date, automatically updating new nodes that join. If the middleware uses a central server as a message Broker [8], it becomes a single-point-of-failure and possibly a bottleneck. Therefore, choose a decentralized middleware solution, if possible, to avoid this problem.

Granularity of modules and interactions are important design decisions that affect failure consequences, performance and reusability of modules. Fine-grained autonomous modules (large number of smaller modules) are easier to reuse and make it easier to isolate faults, but limiting the number of modules and interactions helps to avoid potential performance issues. Modules communicating only locally can be more fine-grained than ones communicating remotely, although the data model should not include location dependencies. Fine-grained interactions give more flexibility, as it will be possible to treat data items separately. Coarse-grained interactions are usually preferred between remote modules in order to avoid overhead, but data that is updated rapidly should be separated from data with slow update rates in order to avoid unnecessary use of bandwidth. Further control over system granularity can be achieved by dividing it to domains.

Compared to message-centric publish/subscribe, one of the differences in data-centric model is that data samples published to topics are transparent to the middleware. In message-centric model, middleware does not know or care about message contents and communication is point-to-point by nature which introduces coupling between modules, although some message-centric middleware also support publishing of messages to topics. Data-centric communication is based on a data model that expresses the state of the system. Since data is interpreted through the model, it is platform-independent and middleware can prioritize, filter and manage the data based on its contents and QoS policies, replacing part of the application logic. Although developing a data model adds to upfront planning efforts, systems with long-term lifecycles benefit in terms of maintainability and evolvability.

Consequences

+:: Publishers do not need to know about subscribers.
+:: Interoperability between heterogeneous platforms since data is interpreted through the data model.
+:: Decoupled design provides error confinement and other benefits such as improved maintainability.
+:: Modules can be changed dynamically because late joiners receive new data automatically; ability to hot-swap code can be easily implemented.
+:: An application or subsystem can be shut down without impacting the overall operation of the system.
+:: Network transport layer is abstracted as communications are externalized to middleware, which reduces communication related code and simplifies implementation.
+:: Gives developers control of data delivery with QoS management; QoS can be used e.g. to guarantee reliable delivery (eventually) or that available data is kept up-to-date with best effort. Former would be useful for sending status changes or commands, whereas latter could be used for sensor measurement for which guaranteeing delivery of outdated samples makes no sense.
+:: Reusability is improved since modules are using shared memory and have their own namespaces, etc.
+:: Publish/subscribe based middleware scales effectively since recipients for data are not explicitly defined.
+:: Performance gains can be achieved on multi-core machines since modules can be easily parallelized and they communicate asynchronously.
±:: Needs good and consistent data models that must be managed and maintained, but a well-thought-out data model improves maintainability and makes reuse of the code easier.
−:: A publisher might assume that a subscriber is listening when it is not.
−:: Sending of commands is not as straightforward as in client-server architectures since commands need to be parsed from the data. However, interactions can be modeled as operation codes sent between two modules.
−:: Parsing of data complicates debugging because it adds another potential source for faults. If data is parsed incorrectly, origin of fault may not be self-evident.
−:: Extra code needed when compared to more monolithic applications since modules cannot presume that all dependencies are started in specific order and available all the time.
−:: Serialization and deserialization of the data structures for transmission may add overhead.
−:: Faults in the middleware itself complicate testing and are hard to detect.
−:: Middleware solutions add some overhead to message size and use system resources.
−:: Possible vendor lock-in to the middleware provider

Known uses.

Data Distribution Service for Real-Time Systems (DDS) is decentralized and data-centric middleware based on the publish/subscribe model. DDS is aimed at mission-critical and embedded systems that have strict performance and reliability requirements. Therefore, its implementations have typically been optimized and tested to suit the needs of these systems. DDS is used as the information backbone in the Thales TACTICOS naval combat management system that integrates various subsystems such as weapons, sensors, counter measures, communication, navigation, etc. to a “system of systems”. Applications are distributed dynamically over a pool of computers in order to provide combat survivability and avoid single-point-of-failures. System configuration can be adapted for use in various mission configurations, on-board & simulator training, and different ship types.

Related Patterns.

Bus abstraction [7], and Publisher-subscriber.

Mediator [9] increases decoupling in a similar fashion, but is designed to decrease connections between objects locally.

Decoupled modules in Data-centric architecture act as Units of mitigation, parts that contain errors and error recovery.

3.2 Service Manager