1 Introduction: From Centralized to Distributed Fault Diagnosis

In systems and control engineering, the adoption of models describing the behavior of systems is ubiquitous and of fundamental importance. However, such models are usually affected by some uncertainty and the sources of uncertainty may vary quite a lot. For instance, the derivation of an accurate mathematical model may be very difficult to obtain or even entail increased financial costs and so, less accurate models are used. Other sources of uncertainty include the measurement noise, the system disturbances, and the changing system parameters due to the components degradation over time. The presence of uncertainty is especially important when considering complex large-scale systems, such as Systems of Systems (SoS) [79] or Cyber-Physical Systems (CPS) [4], where it is difficult to understand and model the relationships that exist among the (possibly large) number of interconnected subsystems. Therefore, uncertainty represents an important challenge for many control applications, thus motivating the research and the development of robust methods able to manage its presence and effect on the control task performance [25, 67, 97, 109]. In some situations, the mismatch between the considered model and the actual system behavior becomes major, due to the presence of undesired or unexpected behaviors, possibly leading to negative consequences such as instabilities, failures in the system, or deterioration of performance. Therefore, it is important to take into consideration modeling uncertainty at the design stage, so that if any unexpected behavior is observed during the system operation, it will be feasible to identify the presence of a fault, avoiding, at the same time, the occurrence of false alarms.

Reliability is a key requirement for modern systems. It can be defined as the ability of a system to perform its intended function over a given period of time [7]. The inability to perform the intended function is called a failure, and it can be due to the effects of a fault. A fault is a change in the behavior of a system, or part of it, from the behavior that was set at design time.

As practical systems become more complex and more interconnected, the need for enhanced robustness, fault tolerance and sustainability becomes of essential importance. Potential faults could lead to major catastrophes and consequently could trigger a chain of failing dependent systems, such as electric power systems, communication and water networks, along with production plants, causing tremendous economic and social damage. Therefore, safe and reliable operation of such systems through the early detection of any “small” fault before they become serious failures is a crucial component of the overall system performance and sustainability.

For these reasons, fault diagnosis is a research field that has been in the forefront of the technological evolution for a few decades and has attracted the attention from the research and industrial communities, as testified by many important survey papers [33, 37, 43, 99,100,101] and books [9, 18, 44, 65].

Generally, fault diagnosis is comprised of several steps: detection of a fault, isolation and identification of the fault and fault accommodation, or reconfiguration of the system.

Fault detection consists of understanding whether a fault has occurred or not, while the isolation task refers to pinpointing the type of fault and its location. Fault identification is an extra step that is carried on after isolation in order to quantify the extent to which a fault is present. Fault accommodation addresses the problem of how the system actively responds to the fault: for example, after a successful fault diagnosis, the controller parameters may be adjusted to accommodate changed plant dynamics in order to prevent failure at the system level.

A control system is comprised of mainly three parts: the actuators, the plant components, and the sensors, therefore a fault may appear in any of these (see Fig. 1). Specifically, process faults (on the plant components) alter the dynamics of the system, sensor faults alter the measurement readings and actuator faults modify the controllers’ influence on the system.

Apart from the fault source, we can further distinguish between abrupt or incipient faults. Abrupt faults are sudden, step-like changes that appear almost instantaneously and can lead to immediate component or even general system failure. On the other hand, incipient faults are slowly developing faults that occur due to parameter changes of the components because of their continuous operation and diminishing lifetime. These changes develop slowly and are initially small, thus harder to detect and may be better prevented through system maintenance.

Fig. 1
figure 1

Fault types and FDI

There are mainly two methods to address the possible presence of a fault. The first one is physical redundancy (or hardware redundancy), that is the fact that critical components of the system are replicated in a greater number than what is strictly necessary. This is effective but implies a highly expensive solution and can be justified only for critical, potentially life-threatening systems (i.e., aviation applications). The second method is the analytical redundancy approach which is based on a mathematical model of the system under healthy system behavior. In this approach, the actual physical signals that are measured, are compared to the corresponding signals given by the mathematical model of the process under healthy state; their difference constitutes the residuals (residual generation stage). Under the ideal conditions of no faults, no modeling uncertainties and no measurement noise nor disturbances, the residuals are zero. In real applications, after the residual generation stage, the information given by residuals is processed to take a decision regarding the health status of the system and determine the potential occurrence of faults (decision making stage). If the fault decision is positive, then further analysis is conducted to identify the fault’s type and location, and possibly its size. Although this approach is more affordable, it is computationally intensive and may be sensitive to false alarms due to inaccuracies in the mathematical modeling of the system which may be mistakenly passed as faults. This model-based approach was born during the 1970s thanks to the seminal works of Beard, Jones and Clark [5, 22, 47] among others (see the survey papers [33, 37, 45, 100]).

An alternative approach to model-based methods is represented by the signal-based techniques, in which known features of signals, such as spectral components or statistical features, are compared to nominal ones [37, 44]. These methods though, require some knowledge of previous behavior of the system during healthy operation and that is the reason why they are classified into the wider class of process history fault diagnosis approaches (i.e., see [99] and the references therein).

Under the analytical redundancy framework, there are various methods to generate the residual vector, which can be divided into two main approaches: the state estimation techniques (such as parity space approach, observer schemes, and detection filters) and the parameter identification techniques. Moreover, in order to ease the fault isolation task, residuals can be designed so as to contain specific isolation properties. The main residual enhancement techniques are represented by structured and directional residuals [38, 100]. In the structured residuals scheme, each fault affects a specific subset of the residuals and any residual responds only to a specific subset of faults. Therefore, due to the dependence of the residuals on the faults, certain patterns appear on the residual vector that can be used for fault isolation. In the directional residuals scheme, each fault amounts to a specific direction in the residual space, and thus fault isolation is concluded by selecting the direction that the generated residual vector lies closest to. More information regarding these techniques can be found in the books by Gertler [39] and Isermann [44]. In the literature, many methods have been proposed for the generation of residuals, which can mainly be classified according the following approaches:

  • Parity space approach. This method consists of checking the consistency of the mathematical equations by using the actual measurements: a fault is declared whenever predetermined error thresholds are exceeded. Further information can be found in [38] and the references therein.

  • Observer schemes. In this category lie many approaches, starting from the Fault detection filter (FDF), first proposed by Beard and Jones in the early 70s, to the Diagnostic Observer approach, which has been widely adopted in the literature. According to this approach, observers are used to reconstruct the output \(\hat{y}\) of the system from measurements y and the residual is represented by the output estimation error \(e=y-\hat{y}\). In the case of stochastic systems, the observers may be substituted by Kalman filters and the residual is the innovation which under the fault-free case should be white noise with zero mean and known covariance. The isolation of faults can be enhanced with the use of a bank of residual generators under the Dedicated Observer Scheme (DOS) proposed by Clark [22] or the Generalized Observer Scheme (GOS) [33, 34]. In both schemes as many residuals as the number of possible faults are generated. The difference is that in the DOS scheme, each residual is sensitive to only a single fault, while in the GOS, each residual is sensitive to every but one fault. The DOS scheme is appealing as it can also isolate concurrent faults, but it cannot always be designed. Instead, the GOS can be always applied, but can only isolate non-concurrent faults. It is important to note that, as pointed out in [34], the observers used in fault diagnosis are primarily output observers which simply reconstruct the measurable part of the state variables, rather than state observers which are required for control purposes. The use of state observers for nonlinear systems has not been used extensively for the FDI problem, even though analytical results regarding the stability of the nonlinear observers and design procedures have been established. The main issue with the observer approach is that the design of observers for nonlinear systems with asymptotically stable error dynamics is not an easy task even when the nonlinearities are fully known. As a result, the research in fault diagnosis for nonlinear systems utilizing state observers is more limited [1, 36, 41, 51].

  • Parameter estimation. This method is particularly suited to the detection of incipient faults and it is extensively studied in the survey papers by Isermann [45] and Frank [33] and the books by Patton et al. [65] and Isermann [44]. Using system identification methods (utilizing the input and output signals), the parameters of a mathematical model of the system can be obtained (recursively and online) across different time intervals and compared to their respective values based on a nominal model. Any significant difference could indicate the occurrence of a fault and a relation between parameter changes and faults can be formed with the use of pattern recognition methods.

An important aspect to be considered when monitoring controlled systems relates to the possibly conflicting dynamic behaviors of the FDI scheme and the reconfigurable controller, namely the feedback controller may hide the presence of faults by compensating their effects (see as example the simulation analysis in [78]) thus making the FDI task much more difficult or even impossible [3, 21, 35, 100]. This is particularly eminent in passive FDI methods, in which the status of health of the system is analyzed by comparing input–output data for the closed-loop system with a process model or historical data. A possible solution has been proposed for this problem when considering application use-cases allowing to affect the closed-loop dynamics by acting at run time on the control inputs. This paves the way to the so-called active FDI methodologies. Active FDI approaches consist of suitably modifying the control input to improve fault detectability and isolability capabilities [2, 6, 20, 42, 71, 73, 82, 87, 92]. The typical main limitation of active FDI techniques concerns high computational cost and complexity. This drawback restricts quite a bit the applicability of this approach to low-dimensional systems [30, 73, 85, 86, 104, 105], even though some approaches have been suggested in the literature to alleviate the computational complexity (see as examples [6, 62]).

An obvious problem in the practical implementation of model-based FDI schemes consists of deriving accurate mathematical models of engineering systems. This is a challenging task and thus, due to the presence of uncertainties and modeling errors, the resulting residual vectors are never identically zero. In addition, generally in the literature, the presence of measurement noise and modeling uncertainty is often overseen. In most real-world applications, such uncertainties may influence significantly the performance of fault detection schemes by causing, for example, false alarms. Therefore, bounds on the residuals must be defined, but still the proper choice remains a major problem. If bounds are chosen too narrow, this may lead to false alarms, whilst if they are chosen too wide faults may pass undetected. Therefore, dealing with the uncertainty in Fault Detection and Isolation architectures is of fundamental importance. As a result, there is a growing demand for robust residual generation to reduce the sensitivity of the residual against the effect of modeling errors, noise and disturbances. This issue can be tackled either by the use of enhanced techniques for robust residual generation or by choosing appropriately the level of the error threshold which can also change adaptively as discussed in the book by Patton et al. [65]. A line of research tried to overcome the problem of accurate mathematical modeling by using qualitative models, where only qualitative information, such as sign or trend of measured variables, are used [101] as well as classification techniques and inference methods. A more successful approach, anyway, is based on the use of adaptive online approximators, such as neural networks as example, to learn online the unknown or uncertain parts of the system dynamical model or the fault model [15, 16, 28, 31, 50, 53, 69, 98, 107].

1.1 Distributed and Networked Large-Scale Systems

In the literature, FDI methods have been historically designed for centralized frameworks, where information about the state of the system is gathered and processed centrally. From a practical perspective, gathering the distributed information into a central processing unit to implement a centralized approach for the fault diagnosis task is counterproductive due to communication overload and the requirement for higher computational power. Moreover, the processing of the information at a centralized station imposes several risks since the station constitutes a single point of failure, thus making the architecture possibly fragile to faults. Recent advances in communications and distributed sensing have allowed the transition from centralized fault diagnosis approaches [9, 18, 33, 65, 100] toward the development of hierarchical, decentralized and distributed schemes [8, 13,14,15, 23, 29, 31, 40, 48, 49, 52, 54,55,56, 66, 75, 76, 78, 84, 89, 90, 96, 102, 108].

In many cases, a distributed FDI framework is not an option but a necessity, since many factors contribute to this formulation such as the large-scale nature of the system to be monitored, its spatial distribution, the inability to access certain parts of the system from a remote monitoring component. Specifically, recent research efforts are focused on decentralized, distributed, networked systems, Cyber-Physical Systems (CPS) [4] and Systems of Systems (SoS) [80]. Examples of these systems include power networks, water distribution networks, transportation systems, smart buildings and complex industrial plants. The term CPS refers to systems with integrated computational and physical capabilities that can interact with humans through many new modalities [4], expanding the capabilities of the physical world through computation, communication, and control. On the other hand, a SoS can be considered as a composition, made of components that are themselves systems, which is characterized by two properties that the whole must possess for it [61]: operational and managerial independence of components. This means that the component systems fulfill their own purposes and continue to operate to fulfill those purposes even if disassembled from the overall system; besides, the component systems are managed (at least in part) for their own purposes rather than the purposes of the whole.

In this chapter, we will use the term networked with two meanings: the considered system can be represented as a network of physically interconnected subsystems, and the monitoring agents operate and collaborate using input–output information obtained through a communication network.

When monitoring this kind of systems, distributed or decentralized algorithms are usually necessary due to computational, communication, scalability and reliability limits. The main benefits of using a distributed fault diagnosis scheme can be summarized as follows: (a) enhanced robustness of the monitoring architecture, since centralized approaches are subject to single point of failure, (b) reduced computation costs, and (c) scalability benefits; the distributed scheme allows for more flexibility in adding subsystems with respective fault detection modules requiring fewer and possibly local modifications in the already existing architecture. Moreover, an emerging requirement is the design of monitoring architectures that are robust to changes that may occur in the dynamic topology of the large-scale systems, allowing the addition/disconnection of subsystem to/from the network of interconnected subsystems only requiring local operations (see for example [11, 13, 78]).

Concerning Cyber-Physical Systems, in the literature, many contributions deal with the description of the technical challenges and design and modeling issues that need to be addressed in order to interface with these modern systems, the technological impact deriving by CPS and the requirements emerging by them [4, 46, 57,58,59, 74, 83, 93, 103, 106]. With regards to reliability, safety and security of CPS, some methods have been proposed ([77], including some recent works dealing with the topic of the detection of cyber-physical attacks and attacks against process control systems [17, 19, 26, 63, 64, 81, 84, 88, 95]. An interesting approach for distributed fault diagnosis is based on exploiting sensor networks [32, 110].

Another important direction of research related to the control and monitoring of large-scale distributed networked systems is the design of distributed Fault-Tolerant Control (FTC) architectures based on passive [8, 10, 78, 91] or active FDI methods [72].

1.2 Outline of the Chapter

Motivated by the issues raised above, in this chapter, we present a distributed FDI architecture specifically designed for uncertain networked nonlinear large-scale systems. We will consider different sources of uncertainty, namely modeling uncertainty, measurement noise, and network-related uncertainties, such as communication delays, packet losses, and asynchronous measurements, and the presence of possibly unknown anomalies. In Sect. 2 the problem formulation is given and the objectives and contributions of this chapter are explained in detail. In Sect. 3, the development of a fault detection scheme is presented in a continuous-time framework based on [48], where a filtering technique, which is embedded in the design of the residual and threshold signals, is used to attenuate the measurement noise. This allows for the design of tight thresholds, and thus enhances fault detectability whilst guaranteeing the absence of false alarms. This filtering approach for fault detection is rigorously investigated, providing results regarding the class of detectable faults, the magnitude of detectable faults and the filtering impact (according to the poles’ location and filters’ order) on the detection time.

Section 4 addresses the need for integration between the different levels composing CPS systems, which are deeply correlated in modern systems, by presenting a comprehensive architecture, based on [14], where all the parts of complex distributed systems are considered: the physical environment, the sensor level, the diagnosers layer, and the communication networks. Based on the problem formulation given in Sect. 2 and on the filtering approach explained in Sect. 3, a distributed fault diagnosis approach is designed for distributed uncertain nonlinear large-scale systems to specifically address the issues emerging when considering networked diagnosis systems, such as the presence of delays and packet dropouts in the communication networks that degrade performance and could be a source of instability, misdetection, and false alarms.

Section 5 discusses some issues regarding fault diagnosis, that is the actions taken after the detection of a fault, for identifying its location and its magnitude or even learning the fault function so that it can be used for fault accommodation schemes. Finally, in Sect. 6, some concluding remarks are given.

2 Problem Formulation

Consider a large-scale distributed nonlinear dynamic system composed of N subsystems \(\varSigma _I\), \(I \in \{1,...,N\}\), each of which is described by the differential equation:

figure a

where \(x_I \in \mathbb {R}^{n_I}\), \(u_I \in \mathbb {R}^{l_I}\) and \(m_I \in \mathbb {R}^{n_I}\) are the state, input and measured output vectors of the I-th subsystem respectively, \(z_I \in \mathbb {R}^{\bar{n}_I}\) is the vector of interconnection variables which are the state variables of the other subsystems \(J \in \{1,\ldots ,N \} \setminus \{ I \}\) that affect the I-th subsystem, \(f_I:\mathbb {R}^{n_I} \times \mathbb {R}^{l_I} \mapsto \mathbb {R}^{n_I}\) is the known local function dynamics of the I-th subsystem and \(g_I:\mathbb {R}^{n_I} \times \mathbb {R}^{\bar{n}_I} \times \mathbb {R}^{l_I} \mapsto \mathbb {R}^{n_I}\) is the known part of the interconnection function between the I-th and the other subsystems. The vector function \(\eta _I:\mathbb {R}^{n_I} \times \mathbb {R}^{\bar{n}_I}\times \mathbb {R}^{l_I} \mapsto \mathbb {R}^{n_I}\) is the overall modeling uncertainty associated with the known local and interconnection function dynamics and \(w_I \in \mathscr {D}_{w_I} \subset \mathbb {R}^{n_I}\) (\(\mathscr {D}_{w_I}\) is a compact set) represents the measurement noise. The state vectors \(x_I\), \(I \in \{1,...,N\}\) are considered unknown whereas their noisy counterparts \(m_I\) are known. Analogously, in the case of the interconnection variable \(z_I\), only its noisy counterpart \(m_{zI}(t)=z_I(t)+\varsigma _I(t)\) is available, where \(\varsigma _I(t)\) is composed by the components of \(w_J\) affecting the relevant components of \(m_J\) (as before J refers to a neighboring subsystem). The term \(\beta _I(t-T_0)\phi _I(x_I,z_I,u_I)\) characterizes the fault function dynamics affecting the I-th subsystem including its time evolution. More specifically, the term \(\phi _I:\mathbb {R}^{n_I} \times \mathbb {R}^{\bar{n}_I} \times \mathbb {R}^{l_I} \mapsto \mathbb {R}^{n_I}\) is the unknown fault function and the term \(\beta _I(t-T_0):\mathbb {R} \mapsto \mathbb {R}^+\) denotes the time evolution of the fault, where \(T_0\) is the unknown time of the fault occurrence [70]. Note that the fault function \(\phi _I\) may depend on the interconnection state variable vector \(z_I\) and not only on the local state vector \(x_I\). In this work, we consider the case of a single fault that occurs in a subsystem (hence there is only one function \(\phi _I(\cdot )\)) and not the case of a distributed fault that spans across several subsystems. Of course, the fault that occurs in a subsystem \(\varSigma _I\) can affect neighboring subsystems \(\varSigma _J\) through the interconnection terms \(z_J\). The fault time profile \(\beta _I(t-T_0)\) can be used to model abrupt faults or incipient faults using a decaying exponential type function:

$$\begin{aligned} \beta _I(t-T_0)\triangleq {\left\{ \begin{array}{ll} 0 &{} \text {if } t<T_0 \\ 1-e^{-b_I(t-T_0)} &{} \text {if } t \ge T_0 \end{array}\right. } \end{aligned}$$
(3)

where \(b_I>0\) is typically an unknown parameter which denotes the fault evolution rate. Abrupt faults correspond to the limit \(b_I \rightarrow \infty \), in this case, the time profile \(\beta _I(t-T_0)\) becomes a step function. In general, small values of \(b_I\) indicate slowly developing faults (incipient faults), whereas large values of \(b_I\) make the time profile \(\beta _I(t-T_0)\) approach a step function (abrupt faults).

In this work, subsystem \(\varSigma _J\) is said to affect subsystem \(\varSigma _I\) (or in other words \(\varSigma _J\) is a “neighbor” of \(\varSigma _I\)), if the interconnection variables of \(\varSigma _I\), i.e., \(z_I(t)\), contains at least one of the state variables of \(\varSigma _J\), i.e., \(x_J(t)\).

The notation \(| \cdot |\) used in this chapter indicates the absolute value of a scalar function or the 2-norm in case of a vector. In addition, the notation \(y(t)=H(s) \big [ x(t) \big ]\) (which is used extensively in the adaptive control literature) denotes the output y(t) of a linear system represented by the transfer function H(s) with x(t) as input. In terms of more rigorous notation, let h(t) be the impulse response associated with H(s); i.e., \(h(t) \triangleq \mathscr {L}^{-1}\left[ H(s)\right] \), where \(\mathscr {L}^{-1}\) is the inverse Laplace transform operator. Then \(y(t)=H(s) \big [ x(t) \big ] = \int _0^t{h(\tau ) x(t-\tau ) } \,\mathrm {d}\tau \).

The following assumptions are used throughout the chapter:

Assumption 1

For each subsystem \(\varSigma _I\), \(I \in \{1,...,N\}\), the local state variables \(x_I(t)\) and the local inputs \(u_I(t)\) belong to a known compact region \(\mathscr {D}_{x_I}\) and \(\mathscr {D}_{u_I}\), respectively, before and after the occurrence of a fault, i.e., \(x_I(t) \in \mathscr {D}_{x_I}\), \(u_I(t) \in \mathscr {D}_{u_I}\) for all \(t \ge 0\). \({} \square \)

Assumption 2

The modeling uncertainty \(\eta _I^{(i)}\) (i denotes the i-th component of \(\eta _I\)) in each subsystem is an unstructured and possibly unknown nonlinear function of \(x_I\), \(z_I\), and \(u_I\) but uniformly bounded by a known positive function \(\bar{\eta }_I^{(i)}\), i.e.,

$$\begin{aligned} | \eta _I^{(i)}(x_I,z_I,u_I) | \le \bar{\eta }_I^{(i)}(m_I,m_{zI},u_I) , \quad i=1,2,\ldots ,n_I \end{aligned}$$
(4)

for all \(t \ge 0\) and for all \((x_I,z_I,u_I) \in \mathscr {D}_I\), where \(m_{zI} = z_I + \varsigma _I\) is the measurable noisy counterpart of \(z_I\), \(\varsigma _I \in \mathscr {D}_{\varsigma _I} \subset \mathbb {R}^{\bar{n}_I}\) and \(\bar{\eta }_I^{(i)}(m_I,m_{zI},u_I) \ge 0\) is a known bounding function in some region of interest \(\mathscr {D}_I = \mathscr {D}_{x_I} \times \mathscr {D}_{z_I} \times \mathscr {D}_{u_I} \subset \mathbb {R}^{n_I} \times \mathbb {R}^{\bar{n}_I} \times \mathbb {R}^{l_I}\). The regions \(\mathscr {D}_{\varsigma _I}\) and \(\mathscr {D}_I\) are compact sets. \({} \square \)

Assumption 1 is required for well posedness since here we do not address the control design and fault accommodation. Assumption 2 characterizes the class of modeling uncertainties being considered. In practice, the system can be modeled more accurately in certain regions of the state space. Therefore, the fact that the bound \( \bar{\eta }_I\) is a function of \(m_I\), \(m_{zI}\) and \(u_I\) provides more flexibility by allowing the designer to take into consideration any prior knowledge of the system. Moreover, the bound \(\bar{\eta }_I\) is required in order to distinguish the effects between modeling uncertainty and faults. For example if the bound \(\bar{\eta }_I\) is not set properly and it is too low so that (4) does not hold, then false alarms may occur. On the other hand, if the bound \(\bar{\eta }_I\) is set too high, so that (4) holds, then this might lead to conservative detection thresholds which may never be crossed, leading to undetected faults. Therefore, the handling of the modeling uncertainty is a key design issue in fault diagnosis architectures, which creates a trade-off between false alarms and conservative fault detection. In Sect. 4.4, adaptive approximation methods will be used to learn the modeling uncertainty \(\eta _I\) and we will use the learned function in order to obtain even tighter detection thresholds and enhance fault detectability.

Fig. 2
figure 2

An example of the proposed multi-layer fault detection architecture. The local state variables for each subsystem (physical layer, left) are measured by the sensor layer (center). The sensors communicate their measurements to the LFDs by means of the first level communication network. The second level communication network (right) allows the diagnosers to communicate with each other exchanging information

Each sensor is associated with exactly one subsystem (see Fig. 2). The local sensor \(S_I^{(i)}\) associated with the I-th subsystem provides a measurement \(m_I^{(i)}\) of the i-th component of the local state vector \(x_I\) according to the output equation

$$\begin{aligned} S_I^{(i)} \, :\, m_I^{(i)}(t)= x_I^{(i)}(t)+w_I^{(i)}(t) \, , \quad i=1,\ldots ,n_I \, , \end{aligned}$$
(5)

where \(w_I^{(i)}\) denotes the noise affecting the i-th sensor of the I-th subsystem.

Assumption 3

For each i-th measurement \(m_I^{(i)}\), with \(i=1,\dots ,n_I\), being the vector component index, the measurement uncertainty term \(w_I^{(i)}\) is an unstructured and unknown function of time, but it is bounded by a known positive time function \( \bar{w}_I^{(i)}(t)\) such that \(\left| w_I^{(i)}(t)\right| \le \bar{w}_I^{(i)}(t)\), \(i=1,\dots ,n_I\), \(I=1, \ldots ,N\), \(t \ge 0\). \({} \square \)

We assume that the control input is available without any error or delay (it is assumed that there exist feedback controllers yielding a local control action \(u_I\) such that some desired control objectives are achieved). Each subsystem is monitored by its respective Local Fault Diagnoser (LFD). The objective is to design and analyze a distributed fault detection scheme, with each subsystem \(\varSigma _I\) being monitored by a LFD that receives local measurements through the first communication network (see Fig. 2) and partial information (i.e., the measurements \(m_{zI}\) of the interconnection variables) from neighboring LFDs through the second communication network. In general, the distributed fault detection scheme is composed of N LFDs \(\mathscr {S}_I\), one for each subsystem \(\varSigma _I\). Each LFD \(\mathscr {S}_I\) requires the input and output measurements of the subsystem \(\varSigma _I\) that it is monitoring and also the measurements of all interconnecting subsystems \(\varSigma _J\) that affect \(\varSigma _I\). Note that these last measurements are communicated by neighboring LFDs \(\mathscr {S}_J\), and not by the subsystems \(\varSigma _J\). Therefore, there is the need of communication between the LFDs depending on their interconnections. It is important to note that, the second layer communication network mirrors the physical coupling morphology. Note that, the information exchanged among the subsystems is readily available since it is constituted by quantities \(z_I\) that are measurable with some uncertainty as \(m_{zI}(t)=z_I(t)+\varsigma _I(t)\) (the noisy counterpart of \(z_I\)). Therefore, the distributed nature of the scheme stems from the fact that there is communication between the LFDs depending on their interconnections. More specifically, each LFD receives from its local sensors the noisy state measurements forming the vector \(m_I=\text {col}(m_I^{(i)}, i=1,\dots ,n_I)\) (see (5)) and from the J-th neighboring LFD the noisy measurements \(m_{zI}^{(i)}, \, i=1,\ldots ,\bar{n}_I\) of the local state variables components \(x_J^{(i)}\) that influence the I-th subsystem (i.e., the variables \(x_J^{(i)}\) belonging to the interconnection vector \(z_I\)). Each LFD computes a local state estimate \(\hat{x}_{I}(t)\) based on the local I-th model, by communicating the interconnection variables (and possibly other information) to neighboring LFDs. The LFD implements a model-based fault detection method: the local residual error vector \(r_I(t)\) is compared, component by component, to a time-varying detection threshold vector \(\bar{r}_I(t)\), well-suited guarantee the absence of false alarms.

2.1 Objectives and Contributions

In this chapter, a distributed fault diagnosis methodology is presented to address the sources of uncertainty mentioned in the introduction. More specifically:

  1. (a)

    a filtering-based design is embedded in a distributed fault diagnosis methodology to dampen the effect of the measurement noise and enhance fault detection robustness by facilitating less conservative conditions for fault detectability;

  2. (b)

    an adaptive learning approach is adopted to reduce the modeling uncertainty and thus, further enhance fault detectability;

  3. (c)

    a delay compensation strategy is devised to address delays and packet losses in the communication network between the LFDs using Time stamps and a buffer, called diagnosis buffer (see Fig. 4);

  4. (d)

    a model-based re-synchronization algorithm is embedded in the diagnosis procedure to manage asynchronous measurements. This algorithm is based on virtual sensors implemented in the LFDs and on the use of a measurements buffer (see Fig. 4);

In the following, we will first present in Sect. 3 the distributed filtering approach in a continuous-time framework under the assumptions of (i) global synchronization, i.e., subsystems, sensors, and LFDs are assumed to share the same clock and sampling frequency and (ii) perfect information exchange, i.e., it is assumed that information exchanged between LFDs and communicated from the system to the LFDs is without any error nor delay and it is immediately available at any point of the diagnosis system. The effect of the filtering on the detectability performance is rigorously analyzed. After that, in Sect. 4, the filtering design is adapted in a discrete-time formulation to allow to analyze the more realistic networked scenarios, where different strategies for managing modeling uncertainty and network-related issues will be integrated in a comprehensive framework.

3 Filtering-Based Distributed Fault Detection

In this section, we present a filtering framework for the detection of faults in a class of interconnected, nonlinear, continuous-time systems with modeling uncertainty and measurement noise (see [48] for more details). In order to address the measurement noise issue which can lead to conservative detection thresholds or even false alarms if not dealt with properly, filtering is used by embedding the filters into the design in a way that takes advantage of the filtering noise suppression properties. Essentially, filtering dampens the effect of measurement noise in a certain frequency range allowing to set less conservative adaptive fault detection thresholds and thus enhancing fault detectability. As a result, a robust fault detection scheme is designed which guarantees no false alarms. The distributed fault detection scheme is comprised of a set of interacting LFDs, in which each subsystem is monitored by its respective detection agent.

To dampen the effect of measurement uncertainty \(w_I(t)\), each measured variable \(m_I^{(i)}\) is filtered by H(s), where H(s) is a p-th order filter with strictly proper transfer function

$$\begin{aligned}&H(s)=sH_p(s), \end{aligned}$$
(6)
$$\begin{aligned}&H_p(s)=\frac{d_{p-2}s^{p-2} +d_{p-3}s^{p-3} + \ldots +d_0}{s^{p}+c_{p-1}s^{p-1}+\ldots +c_1s+c_0}. \end{aligned}$$
(7)

Note that the strictly proper requirement is important. If the transfer function H(s) is proper, then the noise would appear in the filter output and the noise dampening would not be effective.

The choice of a particular type of filter to be used is application dependent, and it is made according to the available a priori knowledge on the noise properties. Usually, measurement noise is constituted by high frequency components and therefore the use of low-pass filter for dampening noise is well justified. On other occasions, one may want to focus the fault detectability on a prescribed frequency band of the measurement signals and hence choose the filter accordingly.

Generally, each measured variable \(m_I^{(i)}(t)\) can be filtered by a different filter. In this chapter, without loss of generality, we consider H(s) to be the same for all the output variables in order to simplify the notation and presentation.

The filters H(s) and \(H_p(s)\) are asymptotically stable and hence BIBO stable. Therefore, for bounded measurement noise \(w_I(t)\) (see Assumption 3), the filtered measurement noise \(\varepsilon _{w_I}(t) \triangleq H(s)\left[ w_I(t)\right] \) is uniformly bounded as follows:

$$\begin{aligned} | \varepsilon _{w_I}^{(i)}(t) | \le \bar{\varepsilon }_{w_I}^{(i)} \quad i=1,2,\ldots ,n_I, \end{aligned}$$
(8)

where \(\bar{\varepsilon }_{w_I}^{(i)}\) are known bounding constants. Depending on the noise characteristics, H(s) can be selected to reduce the bound \(\bar{\varepsilon }_{w_I}^{(i)}\).

3.1 Distributed Fault Detection

In this section, we explain in detail the fault filtering framework in order to obtain the residual signals \(r_I(t)\) to be used for fault detection and the corresponding detection thresholds \(\bar{r}_I(t)\). The fault detection logic is based on deriving suitable detection thresholds so that in the absence of a fault the residual signals are bounded by their corresponding detection threshold signals, guaranteeing no false alarms. To state this formally: in the absence of a fault (i.e., for \(t \in [0,T_0)\)), it is guaranteed that \(| r_I^{(i)}(t) | \le \bar{r}_I^{(i)}(t)\), \(\forall i=1,\ldots ,n_I\) and \(\forall I=1,\ldots ,N\). The detection decision of a fault in the overall system is made when \(| r_I^{(i)}(t) | > \bar{r}_I^{(i)}(t)\) at some time t for at least one component i in any subsystem \(\varSigma _I\). Note that, in this chapter, only a single fault \(\phi _I\) is considered to occur in the large-scale distributed system.

By locally filtering the output signal \(m_I(t)\), we obtain the filtered output \(y_{I,f}(t)\):

$$\begin{aligned} y_{I,f}(t)&=H(s) \left[ m_{I}(t) \right] \\&=H(s) \left[ x_{I}(t) + w_{I}(t) \right] \nonumber . \end{aligned}$$
(9)

By using \(\varepsilon _{w_I}(t) = H(s)\left[ w_I(t)\right] \) and the fact that \(s[x_I(t)]=\dot{x}_I(t)+x_I(0) \delta (t)\) (where \(\delta (t)\) is the delta function), we obtain

$$\begin{aligned} y_{I,f}(t)&= H(s)\left[ x_{I}(t) \right] +\varepsilon _{w_I}(t) \nonumber \\&=H_p(s) \left[ \dot{x}_{I}(t) \right] + H_p(s) \left[ x_I(0) \delta (t) \right] + \varepsilon _{w_I}(t) \nonumber \\&=H_p(s) \big [ f_I\big (x_I(t),u_I(t))+ g_I(x_I(t),z_I(t),u_I(t)\big ) \nonumber \\&\quad \qquad \quad + \eta _I\big (x_I(t),z_I(t),u_I(t)\big ) + \beta _I(t-T_0)\phi _I\big (x_I(t),z_I(t),u_I(t)\big )\big ] \nonumber \\&\quad \quad \quad \,\,\,\,\, +\varepsilon _{w_I}(t) + h_p(t) x_I(0) , \end{aligned}$$
(10)

where \(h_p(t)\) is the impulse response of the filter \(H_p(s)\), i.e., \(h_p(t) \triangleq \mathscr {L}^{-1}\left[ H_p(s) \right] \). The estimation model \(\hat{x}_{I}(t)\) for \(x_I(t)\) under fault-free operation is generated based on (1) by considering only the known components and by using the measurements \(m_I\) and \(m_{zI}\) as follows:

$$\begin{aligned} \dot{\hat{x}}_{I} = f_I(m_I(t),u_I(t))+ g_I(m_I(t),m_{zI}(t),u_I(t)), \end{aligned}$$
(11)

with the initial condition \(\hat{x}_I(0)=m_I(0)\).

The corresponding estimation model for \(y_{I,f}(t)\), denoted by \(\hat{y}_{I,f}(t)\), is given by

$$\begin{aligned} \hat{y}_{I,f}(t)=H(s) \big [ \hat{x}_{I}(t) \big ], \end{aligned}$$
(12)

and by using (11) and following a similar procedure as in the derivation of (10), \(\hat{y}_{I,f}(t)\) becomes

$$\begin{aligned} \hat{y}_{I,f}(t)=&H_p(s) \big [ f_I\big (m_I(t),u_I(t)\big )+ g_I\big (m_I(t),m_{zI}(t),u_I(t)\big ) \big ] +h_p(t) m_I(0). \end{aligned}$$
(13)

The local residual error \(r_I(t)\) to be used for fault detection is defined as

$$\begin{aligned} r_I(t) \triangleq y_{I,f}(t) - \hat{y}_{I,f}(t), \end{aligned}$$
(14)

and it is readily computable from Eqs. (9), (11) and (12).

Prior to the fault (\(t<T_0\)), the local residual error can be written using Eqs. (10), (13) and (14) as

$$\begin{aligned} r_I(t) = H_p(s) \left[ \chi _I(t) \right] +\varepsilon _{w_I}(t) \end{aligned}$$
(15)

where the total uncertainty term \(\chi _I(t)\) is defined as

$$\begin{aligned}&\chi _I(t) \triangleq \varDelta f_I(t) + \varDelta g_I(t) + \eta _I\big (x_I(t),z_I(t),u_I(t)\big ), \end{aligned}$$
(16)
$$\begin{aligned}&\varDelta f_I(t) \triangleq f_I\big ( x_I(t),u_I(t)\big ) - f_I \big ( x_I(t)+w_I(t),u_I(t) \big ), \end{aligned}$$
(17)
$$\begin{aligned}&\varDelta g_I(t) \triangleq g_I\big (x_I(t),z_I(t),u_I(t) \big ) - g_I \big ( x_I(t)+w_I(t),z_I(t)+\varsigma _I(t),u_I(t) \big ). \end{aligned}$$
(18)

For simplicity, in the derivation of (15), the initial conditions \(x_I(0)=m_I(0)\) are assumed to be known. If there is uncertainty in the initial conditions (i.e., \(x_I(0) \ne m_I(0)\)) then that introduces the extra term \(h_p(t)(x_I(0)-m_I(0))\) in (15) which however converges to zero exponentially (since \(h_p(t)\) is exponentially decaying [24]) and thus does not affect significantly the subsequent analysis.

By taking bounds on (15) and by using the triangle inequality for each component i of the residual, we obtain

$$\begin{aligned} | r_I^{(i)}(t) |&\le | H_p(s) \left[ \chi _I^{(i)}(t) \right] | + | \varepsilon _{w_I}^{(i)}(t) | = | \int _{0}^t h_p (t-\tau ) \chi _I^{(i)}(\tau ) \,\mathrm {d}\tau | + | \varepsilon _{w_I}^{(i)}(t) | \nonumber \\&\le \int _{0}^t | h_p (t-\tau ) | | \chi _I^{(i)}(\tau ) | \,\mathrm {d}\tau + | \varepsilon _{w_I}^{(i)}(t) | \nonumber \\&\le \int _{0}^t \bar{h}_p (t-\tau ) \bar{\chi }_I^{(i)}(\tau ) \,\mathrm {d}\tau +\bar{\varepsilon }_{w_I}^{(i)} \end{aligned}$$
(19)

where \(\bar{h}_p(t)\) is the impulse response (of the filter \(\bar{H}_p(s)\)) that satisfies \(| h_p(t) | \le \bar{h}_p(t)\) for all \(t>0\) (details for selecting \(\bar{H}_p(s)\) will be given in Sect. 3.2) and \(\bar{\chi }_I^{(i)}(t) \) is the bound on the total uncertainty term \(\chi _I^{(i)}(t) \), i.e., \( | \chi _I^{(i)}(t) | \le \bar{\chi }_I^{(i)}(t)\).

Using Assumption 2, the bound \(\bar{\chi }_I^{(i)}(t), \,\, i=1,2,\ldots ,n_I\) is defined as

$$\begin{aligned} \begin{aligned} \bar{\chi }_I^{(i)}(t) \triangleq&\overline{\varDelta f}_I^{(i)} + \overline{\varDelta g}_I^{(i)} + \bar{\eta }_I^{(i)}\big (m_I(t),m_{zI}(t),u_I(t)\big ), \end{aligned} \end{aligned}$$
(20)

where

$$\begin{aligned} \overline{\varDelta f}_I^{(i)}&\triangleq \sup _{ \begin{array}{c} (x_I,u_I) \in \mathscr {D}_{x_I} \times \mathscr {D}_{u_I} \\ w_I \in \mathscr {D}_{w_I} \end{array}} | f_I^{(i)}\big ( x_I,u_I\big ) - f_I^{(i)} \big ( x_I+w_I,u_I \big ) | \end{aligned}$$
(21)
$$\begin{aligned} \overline{\varDelta g}_I^{(i)}&\triangleq \sup _{\begin{array}{c} (x_I,z_I,u_I) \in \mathscr {D}_{I} \\ (w_I,\varsigma _I) \in \mathscr {D}_{w_I} \times \mathscr {D}_{\varsigma _I} \end{array}} | g_I^{(i)}\big (x_I,z_I,u_I \big ) - g_I^{(i)} \big ( x_I+w_I,z_I+\varsigma _I,u_I \big ) | . \end{aligned}$$
(22)

Since the regions \(\mathscr {D}_I\), \(\mathscr {D}_{w_I}\) and \(\mathscr {D}_{\varsigma _I}\) are compact sets, the suprema in (21) and (22) are finite. In addition, note that the bound \(\bar{\chi }_I^{(i)}(t)\) in (20) depends on t because of the bounding function \(\bar{\eta }_I^{(i)}\).

Finally, a suitable detection threshold \(\bar{r}_I^{(i)}(t)\) can be selected as the right-hand side of (19) which can be rewritten as

$$\begin{aligned} \bar{r}_I^{(i)}(t) = \bar{H}_p(s) \left[ \bar{\chi }_I^{(i)}(t) \right] +\bar{\varepsilon }_{w_I}^{(i)}. \end{aligned}$$
(23)

A practical issue that requires consideration is the derivation of the bound \(\bar{\chi }_I^{(i)}(t) \) given in (20). Specifically, the derivation of \(\bar{\chi }_I^{(i)}(t) \) requires the bounds \(\overline{\varDelta f}_I^{(i)}\) and \(\overline{\varDelta g}_I^{(i)}\) on \(\varDelta f_I^{(i)}(t)\) and \(\varDelta g_I^{(i)}(t)\), respectively. One approach for deriving the bound \( \overline{\varDelta f}_I^{(i)}\) in (21) is to consider a local Lipschitz assumption:

$$\begin{aligned} | f_I^{(i)}(x_I,u_I) - f_I^{(i)}(x_I+w_I,u_I) | \le L_{f_I^{(i)}} | w_I | \end{aligned}$$
(24)

where \(L_{f_I^{(i)}}\) is the Lipschitz constant for the function \(f_I^{(i)}(x_I,u_I)\) with respect to \(x_I\) in the region \( \mathscr {D}_{x_I}\). Therefore, if we have a bound \(w_{I}^M\) on the measurement noise, i.e., \(| w_I(t) | \le w_{I}^M \quad \forall t>0\), then we can obtain a bound on \(\varDelta f_I^{(i)}(t)\). A similar approach can be followed for \(\varDelta g_I^{(i)}(t)\).

Another way of obtaining a less conservative bound than \(\bar{\chi }_I^{(i)}\) and therefore further enhance fault detectability, is by exploiting the use of filtering which can be proved beneficial for dampening the mismatch function \( \varDelta f_I(t) + \varDelta g_I(t)\) which results due to the measurement noise. Among the various filters one can select, some may lead to less conservative detection thresholds. Therefore, a significantly less conservative detection threshold without the need for the Lipschitz constants can be obtained by observing that the residual (15) can be written as

$$\begin{aligned} r_I(t) =&H_p(s) \left[ \eta _I\big (x_I(t),z_I(t),u_I(t)\big ) \right] +H_p(s) \left[ \varDelta f_I(t) + \varDelta g_I(t) \right] +\varepsilon _{w_I}(t) \end{aligned}$$
(25)

and by making the following assumptions:

Assumption 4

The filtered function mismatch term \(\varepsilon _{\varDelta _I}(t) \triangleq H_p(s) \left[ \varDelta f_I(t) +\right. \) \(\left. \varDelta g_I(t) \right] \) is uniformly bounded as follows:

$$\begin{aligned} | \varepsilon _{\varDelta _I}^{(i)}(t) | \le \bar{\varepsilon }_{\varDelta _I}^{(i)} \quad i=1,2,\ldots ,n_I, \end{aligned}$$
(26)

where \(\bar{\varepsilon }_{\varDelta _I}^{(i)}\) is a known bounding constant. \({} \square \)

Assumption 4 is based on the fact that filtering dampens the effect of measurement noise present in the function mismatch term \(\varDelta f_I(t) + \varDelta g_I(t)\). A suitable selection of \(\bar{\varepsilon }_{\varDelta _I}^{(i)}\) can be made through the use of simulations (i.e., Monte Carlo methods) by filtering the function mismatch term \(\varDelta f_I(t)+\varDelta g_I(t)\) using the known function dynamics and the available noise characteristics (recall that the measurement noise is assumed to take values in a compact set).

Therefore, the detection threshold becomes

$$\begin{aligned} \bar{r}_I^{(i)}(t) =&\bar{H}_p(s) \left[ \bar{\eta }_I^{(i)}\big (m_I(t),m_{zI}(t),u_I(t)\big ) \right] +\bar{\varepsilon }_{\varDelta _I}^{(i)} +\bar{\varepsilon }_{w_I}^{(i)}. \end{aligned}$$
(27)
Fig. 3
figure 3

Local filtered fault detection scheme

Figure 3 illustrates the I-th LFD which includes the implementation of the local filtered fault detection scheme for the I-th subsystem resulting from Eqs. (9), (11), (12), (14) and (23).

3.2 Selection of Filter \(\bar{H}_p(s)\)

Two methods for selecting a suitable transfer function \(\bar{H}_p(s)\) with impulse response \(\bar{h}_p(t)\) such that \(| h_p(t) | \le \bar{h}_p(t)\) for all \(t \ge 0\) are illustrated.

In general though, note that if the impulse response \(h_p(t)\) is nonnegative, i.e., \(h_p(t) \ge 0\), for all \(t \ge 0\), then the calculation of \(\bar{H}_p(s)\) can be omitted. In this case \(H_p(s)\) can be used instead of \(\bar{H}_p(s)\) in (23), as it can easily be seen from (19) since \(| h_p(t-\tau ) |=h_p(t-\tau )\). Necessary and sufficient conditions for nonnegative impulse response for a specific class of filters are given in [60].

  • First method.

The first method relies on the following Lemma, which describes a methodology for finding \(\bar{H}_p(s)\). For notational convenience, for any \(m \times n\) matrix A we define \(\left| A \right| _{\mathscr {E}}\) as the matrix whose elements correspond to the modulus of the element \(a_{i,j}\), \(i=1,\ldots ,m\) and \(j=1,\ldots ,n\) of the matrix A.

Lemma 1

([48]). Let \(w(t) = C e^{At} B\) be the impulse response of a strictly proper SISO transfer function W(s) with state space representation (ABC). Then, for any signal \(v(t) \ge 0\), the following inequality holds for all \(t \ge 0\):

$$\begin{aligned} \int _{0}^t | w (t-\tau ) | v(\tau ) \,\mathrm {d}\tau \le \overline{W}(s) \left[ v(t) \right] , \end{aligned}$$

where \(\overline{W}(s)\) is given by

$$\begin{aligned} \overline{W}(s) \triangleq \left| CT \right| _{\mathscr {E}} (sI-\text {Re}[J])^{-1} \left| T^{-1}B \right| _{\mathscr {E}} \end{aligned}$$
(28)

and \(J=T^{-1}AT\) is the Jordan form of the matrix A.

Therefore, by using Lemma 1 with \(w(t)=h_p(t)\), the transfer function \(\bar{H}_p(s)\) such that its impulse response satisfies \(| h_p(t) | \le \bar{h}_p(t)\) can be obtained from (28).

  • Second method.

The second method is by using the following well-known result (see, for instance [24]).

Lemma 2

The impulse response \(h_p(t)\) of a strictly proper and asymptotically stable transfer function \(H_p(s)\) decays exponentially; i.e., \(| h_p(t) | \le \kappa e^{-\upsilon t}\) for some \(\kappa >0\), \(\upsilon >0\), for all \(t \ge 0\).

By using Lemma 2, a suitable impulse response \(\bar{h}_p(t)\) such that \(| h_p(t) | \le \bar{h}_p(t)\) for all \(t \ge 0\) is given by \(\bar{h}_p(t)=\kappa e^{-\upsilon t}\) and can be implemented using linear filtering techniques as \(\bar{H}_p(s) = \frac{\kappa }{s+\upsilon }\).

3.3 Fault Detectability and Detection Time Analysis

3.3.1 Fault Detectability Analysis

The design and analysis of the fault detection scheme in the previous sections were based on the derivation of suitable thresholds \(\bar{r}_I^{(i)}(t)\) such that in the absence of any fault, the residual signals \(r_I^{(i)}(t)\) are bounded by \(\bar{r}_I^{(i)}(t)\). An important related question is what class of faults can be detected. This is referred to as fault detectability analysis. In this section, fault detectability conditions for the aforementioned fault detection scheme are derived. The fault detectability analysis constitutes a theoretical result that characterizes quantitatively the class of faults detectable by the proposed scheme.

Theorem 1

Consider the nonlinear system (1), (2) with the distributed fault detection scheme described in (9), (11), (12), (14) and (23) in the general case of H(s) given by (6). A sufficient condition for a fault \(\phi _I^{(i)}(x_I,z_I,u_I)\) in the I-th subsystem initiated at \(T_0\) to be detectable at time \(T_d>T_0\) is that for some \(i=1,2,\ldots ,n_I\):

$$\begin{aligned} | H_p(s) \big [ \beta _I(T_d-T_0)\phi _I^{(i)}\big (x_I(T_d),z_I(T_d),u_I(T_d)\big ) \big ] | >2\bar{r}_I^{(i)}(T_d). \end{aligned}$$
(29)

Proof

In the presence of a fault that occurs at \(T_0\), Eq. (15) becomes

$$\begin{aligned} r_I^{(i)}(t) =&H_p(s) \big [ \chi _I^{(i)}(t) + \beta _I(t-T_0)\phi _I^{(i)}\big (x_I(t),z_I(t),u_I(t)\big ) \big ] +\varepsilon _{w_I}^{(i)}(t). \end{aligned}$$

By using the triangle inequality, for \(t>T_0\), the residual \(r_I^{(i)}(t)\) satisfies

$$\begin{aligned} | r_I^{(i)}(t) | \ge&- | H_p(s) \big [ \chi _I^{(i)}(t) \big ] | - | \varepsilon _{w_I}^{(i)}(t) | \\&+ | H_p(s) \big [ \beta _I(t-T_0)\phi _I^{(i)}\big (x_I(t),z_I(t),u_I(t)\big ) \big ] |\\ \ge&- \int _{0}^t | h_p (t-\tau ) | | \chi _I^{(i)}(\tau ) | \,\mathrm {d}\tau - | \varepsilon _{w_I}^{(i)}(t) | \\&+ | H_p(s) \big [ \beta _I(t-T_0)\phi _I^{(i)}\big (x_I(t),z_I(t),u_I(t)\big ) \big ] |\\ \ge&- \int _{0}^t \bar{h}_p (t-\tau ) \bar{\chi }_I^{(i)}(\tau ) \,\mathrm {d}\tau - \bar{\varepsilon }_{w_I}^{(i)} \\&+ | H_p(s) \big [ \beta _I(t-T_0)\phi _I^{(i)}\big (x_I(t),z_I(t),u_I(t)\big ) \big ] |\\ \ge&- \bar{r}_I^{(i)}(t) + | H_p(s) \big [ \beta _I(t-T_0)\phi _I^{(i)}(x_I(t),z_I(t),u_I(t)) \big ] |. \end{aligned}$$

For fault detection, the inequality \(| r_I^{(i)}(t) | > \bar{r}_I^{(i)}(t)\) must hold at some time \(t=T_d\) for some \(i=1,\ldots ,n_I\), so the final fault detectability condition given by (29) is obtained.    \(\square \)

Although Theorem 1 is based on threshold (23), it can be readily shown that the same result holds in the case where threshold (27) is used. Clearly, the fault functions \(\phi _I(x_I,z_I,u_I)\) are typically unknown and therefore this condition cannot be checked a priori. However, it provides useful intuition about the type of faults that are detectable. The detectability condition given in Theorem 1 is a sufficient condition, but not a necessary one and hence, the class of detectable faults can be significantly larger. The use of filtering is of crucial importance in order to derive tighter detection thresholds that guarantee no false alarms. As it can be seen in the detectability condition given by (29), the detection of the fault depends on the filtered fault function \(\phi _I\) and as a result, the selection of the filter is very important. Since the fault function is usually comprised of lower frequency components, it is not affected that much by low-pass filtering in comparison to the measurement noise which is usually of higher frequency. In addition, filtering allows the derivation of tighter detection thresholds and as a result, the fault detectability condition can be met more easily. Obviously, some filter selections may lead to less conservative thresholds than others.

The detectability properties of the proposed filtering approach are further investigated by considering a specific case for the filter \(H_p(s)\)

$$\begin{aligned} H_p(s)=\frac{\alpha ^p}{(s+\alpha )^p}. \end{aligned}$$
(30)

This type of filter is well suited for gaining further intuition since it contains two parameters p and \(\alpha \) that denote the order of the filter and the pole location, respectively. More specifically, the order p of the filter regulates the damping effect of the high frequency noise, whereas the value \(\alpha \) of the filter determines the cutoff frequency at which the damping begins. In general, more selective filter implementations can be made (i.e., Butterworth filters) which may have some implications in the filters required for the detection threshold implementation (due to the fact that the impulse response may not be always positive). But, the particular filter \(H_p(s)\) given by (30) is perfectly suited for the investigation of the analytical properties of the filtering scheme. Note also that \(H_p(s)\) has a nonnegative impulse response \(h_p(t)\) and therefore \(\bar{H}_p(s)\) can be selected simply as \(H_p(s)\).

In order to conduct this fault detectability analysis, we simplify Assumption 2 by considering a constant bounding condition. It is important to note that the constant bounding of the uncertainty may introduce additional conservativeness, thus reducing the advantage given by the tighter conditions obtained through the filtering.

Assumption 5

The modeling uncertainty \(\eta _I^{(i)}\) in each subsystem is an unstructured and possibly unknown nonlinear function of \(x_I\), \(z_I\) and \(u_I\) but uniformly bounded by a known positive scalar \(\bar{\eta }_I^{(i)}\), i.e.,

$$\begin{aligned} | \eta _I^{(i)}(x_I,z_I,u_I) | \le \bar{\eta }_I^{(i)}, \quad i=1,2,\ldots ,n_I \end{aligned}$$
(31)

for all \(t \ge 0\) and for all \((x_I,z_I,u_I) \in \mathscr {D}_I\), where \(\bar{\eta }_I^{(i)} \ge 0\) is a known bounding scalar in some region of interest \(\mathscr {D}_I = \mathscr {D}_{x_I} \times \mathscr {D}_{z_I} \times \mathscr {D}_{u_I} \subset \mathbb {R}^{n_I} \times \mathbb {R}^{\bar{n}_I} \times \mathbb {R}^{l_I}\). \({} \square \)

By using the Lipschitz assumption stated in (24), along with the known constant bound \(w_I^M\) of the measurement uncertainty \(| w_I |\) and the constant bound on the modeling uncertainty \(\bar{\eta }_I^{(i)}\), as stated in Assumption 5, the bound of the total uncertainty term \(\bar{\chi }_I^{(i)}(t)\) takes a constant value \(\bar{\chi }_I^{(i)}\). Then, Theorem 2, which follows, can be obtained (its proof can be found in [48]).

It must be pointed out that, although we use (23) for the detection threshold, the adaptation of the subsequent results in the case where the threshold is given by (27) is straightforward by simply replacing \(\bar{\chi }_I^{(i)}\) with \(\bar{\eta }_I^{(i)}\) and adding the term \(\bar{\varepsilon }_{\varDelta _I}^{(i)}\) along the term \(\bar{\varepsilon }_{w_I}^{(i)}\) in what follows.

Theorem 2

Consider the nonlinear system (1), (2) with the distributed fault detection scheme described in (9), (11), (12), (14) and (23) in the special case of \(H_p(s)\) given by (30) and with \(\bar{H}_p(s)=H_p(s)\). Suppose at least one component \(\phi _I^{(i)}(x_I,z_I,u_I)\) of the fault vector \(\phi _I(x_I,z_I,u_I)\) satisfies the condition

$$\begin{aligned} | \phi _I^{(i)}(x_I(t^\prime ),z_I(t^\prime ),u_I(t^\prime )) | \ge M, \quad \forall t^\prime \in \left[ T_0,t \right] , \end{aligned}$$
(32)

for sufficiently large \(t>T_0\) and is continuous in the time interval \(t^\prime \in \left[ T_0,t \right] \). If \(M > 2 ( \bar{\chi }_I^{(i)} + \bar{\varepsilon }_{w_I}^{(i)})\), then the fault will be detected, that is \(| r_I^{(i)}(t) | > \bar{r}_I^{(i)}(t)\).

The aforementioned theorem is conceptually different from Theorem 1. More specifically, the detectability condition (29) of Theorem 1 allows the fault function \(\phi _I^{(i)}\) to change sign. On the other hand, Theorem 2 states that if the fault function \(\phi _I^{(i)}\) maintains the same sign over time and its magnitude is larger than \(2 (\bar{\chi }_I^{(i)} + \bar{\varepsilon }_{w_I}^{(i)} )\) for sufficiently long, then the fault is guaranteed to be detected.

3.3.2 Detection Time Analysis

The detection time of a fault, that is, the time interval between the fault occurrence and its detection, plays a crucial role in fault diagnosis and it constitutes a form of performance criterion. When a fault is detected faster, then timely actions can be undertaken to avoid more serious or even disastrous consequences. It is worth noting that incipient faults are more difficult to detect, especially during their early stages, and as a result the detection time of an incipient fault is generally larger than that of an abrupt fault. In this section, an upper bound of the detection time is obtained in the case where a fault is detected according to Theorem 2. Moreover, we investigate the influence of the filter’s order p and the pole location \(\alpha \) on the upper bound of the detection time in order to derive some insight regarding the selection of p and \(\alpha \). The results are obtained for the general case of an incipient fault; concerning the dependence of the detection time on the filter’s order p, only the abrupt fault case is addressed for the sake of simplicity.

Theorem 3

Consider the nonlinear system (1), (2) with the distributed fault detection scheme described in (9), (11), (12), (14) and (23) in the special case of \(H_p(s)\) given by (30) and with \(\bar{H}_p(s)=H_p(s)\). If at least one component \(\phi _I^{(i)}(x_I,z_I,u_I)\) of the fault vector \(\phi _I(x_I,z_I,u_I)\) satisfies the condition

$$\begin{aligned} \left| \phi _I^{(i)}\big (x_I(t^\prime ),z_I(t^\prime ),u_I(t^\prime )\big ) \right| \ge M, \quad \forall t^\prime \in \left[ T_0,t \right] \end{aligned}$$
(33)

where \(M> 2 (\bar{\chi }_I^{(i)} + \bar{\varepsilon }_{w_I}^{(i)})\) for sufficiently large \(t>T_0\) and is continuous in the time interval \(t^\prime \in \left[ T_0,t \right] \) such that the fault can be detected according to Theorem 2, then:

  1. (a)

    A sufficient condition for fault detectability is given by

    $$\begin{aligned} q(t,T_0,\alpha ) > \frac{2 (p-1)! }{M} (\bar{\chi }_I^{(i)} + \bar{\varepsilon }_{w_I}^{(i)}). \end{aligned}$$
    (34)

    where

    $$\begin{aligned} q(t,T_0,\alpha )&\triangleq q_1(t,T_0,\alpha )-q_2(t,T_0,\alpha ) \end{aligned}$$
    (35)
    $$\begin{aligned} q_1(t,T_0,\alpha ) = \gamma \big ( p,\alpha (t-T_0) \big ), \end{aligned}$$
    (36)
    $$\begin{aligned} q_2(t,T_0,\alpha )= {\left\{ \begin{array}{ll} &{} \frac{\alpha ^p}{p} (t-T_0)^p e^{-\alpha (t-T_0)} \qquad \qquad \text{ if } \alpha =b_I \\ &{} \frac{\alpha ^pe^{-b_I(t-T_0)}}{(a-b_I)^p} \gamma \big ( p,(\alpha -b_I)(t-T_0) \big ) \text{ else, } \end{array}\right. } \end{aligned}$$
    (37)

    and \(\gamma ( \cdot )\) indicates the lower incomplete Gamma function, defined as \(\gamma \big ( p,z \big ) \triangleq \int _{0}^z w^{p-1} e^{-w} \,\mathrm {d}w\).

  2. (b)

    An upper bound on the detection time \(T_d\) of an incipient fault can be found by solving the equation

    $$\begin{aligned} q_1(T_d,T_0,\alpha )-q_2(T_d,T_0,\alpha ) = \frac{2 (p-1)! }{M} \bar{r}_I^{(i)}(T_d), \end{aligned}$$
    (38)

    where \(\bar{r}_I^{(i)}\) is given by

    $$\begin{aligned} \begin{aligned} \bar{r}_I^{(i)}(t) = \frac{1}{(p-1)!} \bar{\chi }_I^{(i)} \gamma \big (p,\alpha t \big ) + \bar{\varepsilon }_{w_I}^{(i)}. \end{aligned} \end{aligned}$$
    (39)
  3. (c)

    The upper bound \(T_d\) decreases monotonically as the value of \(\alpha \) increases.

  4. (d)

    In the case of abrupt faults, the upper bound on the detection time \(T_d\) increases as the order p of the filter increases.

The proof of Theorem 3 can be found in [48]. Part (b) of the above theorem establishes the mathematical equation whose solution gives an upper bound on the detection time. At this point, we must stress that, although we refer to the solution of the equation as the upper bound of the detection time (because of the requirement (32)), there are cases where the solution is the actual detection time. For instance, consider the case where the magnitude of the fault is \( \big | \phi _I^{(i)}(x_I(t^\prime ),z_I(t^\prime ),u_I(t^\prime )) \big |= M, \quad \forall t^\prime \in \left[ T_0,t \right] \) and \(M > 2(\bar{\chi }_I^{(i)} + \bar{\varepsilon }_{w_I}^{(i)})\). Then, the solution of (38) gives the actual detection time.

Part (c) of the theorem shows that by increasing the value of the pole \(\alpha \), the upper bound on the detection time (and sometimes the actual detection time as explained before) decreases. On the other hand, the value of \(\alpha \) regulates the cutoff frequency of the filter where the damping begins, so the pole location has an inherent trade-off between noise damping and fault detection speed.

Part (d) of the theorem states that in the case of abrupt faults, the upper bound on the detection time increases as the order p of the filter increases. Although the proof is for the case of abrupt faults, the same behavior is observed in the case of incipient faults as well. An obvious downside of higher order filtering is the possible increased detection time. There is also a qualitative explanation for Part (d), as it has necessarily to do with the phase lag introduced by the filter which increases with p. Simply put, by increasing p results in increased phase lag or delay between the input and output signals of the filter and since the detectability of a fault relies on the filtered signals, the detection time increases according to the delay incurred.

Remark 1

Prior to the occurrence of a fault, the residual differs from zero due to the effect of the filtered noise and filtered modeling uncertainty as indicated by (15). When a fault occurs, the residual is permanently contaminated by the filtered fault function as shown in the proof of Theorem 1. In general, the location of the poles simply affects the effectiveness of the noise dampening. To make things more clear, consider Theorems 2 and 3 which rely on the special case of the filter \(H_p(s)\) given in (30). Theorem 2, states that in the case of a fault (abrupt or incipient), which satisfies the conditions given in the Theorem then the fault is guaranteed to be detected. Note that this is irrespective of the location of the filters’ poles. In fact, as shown in Theorem 3, having faster poles results in a smaller upper bound on the detection time or even smaller actual detection time. In conclusion, the location of the poles does not limit the duration of the residual activation when a fault occurs, but instead the residual is permanently affected by the filtered fault function. Therefore, the location of the poles has an inherent trade-off between noise damping and fault detection speed.

Simulation results showing the effectiveness of the illustrated techniques can be found in [48].

4 The Cyber-Physical Networked Architecture

In this section, we present a cyber-physical networked fault detection architecture based on [14]. Let us note that the approach for distributed fault diagnosis of nonlinear uncertain large-scale systems that we have previously described is based on some underlying assumptions that may restrict its applicability, namely:

  1. 1.

    global synchronization: subsystems, sensors, and LFDs were assumed to share the same clock and sampling frequency;

  2. 2.

    perfect information exchange: it was assumed that information exchanged between LFDs and communicated from the system to the LFDs is without any error nor delay and it is immediately available at any point of the diagnosis system.

In several realistic contexts, (1) and (2) may not hold, and as a consequence, (i) some faults may become undetectable due to the fact that LFDs make detection decisions based on outdated information; (ii) delays in information exchange may cause longer detection times; (iii) the lack of accurate and timely information may cause false alarms.

In order to address these issues and the more complex nature of real CPS systems, we now consider a more comprehensive framework, where the previously proposed filtering design to reduce measurement noise is adapted in the current formulation in discrete time.

The proposed distributed fault detection architecture is made of three layers: the system layer, the sensor layer and the diagnosis layer. In Fig. 2, this layout was shown in a pictorial way. These three layers are briefly described next.

The system layer refers to the large-scale system to be monitored. It is described by the continuous-time state equations for each subsystem Eq. (1) and the output Eq. (2).

The sensor layer consists of the available sensors taking measurements \(m_I^{(i)}(t)\) in continuous-time (see (5)) and sampling and sending such measurements to the I-th LFD at time instants \(t_{sI}^{(i)}\) that are not necessarily equally spaced in time. As we do not assume that the measurements delivered by the sensors are synchronized with each other, each measurement is labeled with a Time Stamp (TS) [94] to indicate the time instant \(t_{sI}^{(i)}\) at which the measurements are taken by sensor \(S_I^{(i)}\) in the time coordinate t.

The communication between the sensors and the LFDs is achieved through the first level communication network (see Fig. 2). This network can introduce delays and packet losses, for instance because of collision between different sensors trying to communicate at the same time. Therefore, measurements communicated from the sensors to LFDs may be received at any time instant.

The Diagnosis layer consists of the previously introduced LFDs providing a distributed fault diagnosis procedure. The structure of each LFD is shown in Fig. 4. As previously mentioned, each LFD receives the measurements from specific sensors with the aim to provide local fault diagnosis decisions. The LFDs operate in a discrete-time synchronous time frame \(k\in \mathbb {Z}\) which turns out to be more convenient for handling any communications delays, as will be seen in the next sections. For the sake of simplicity, the sampling time of the discrete-time frame is assumed to be unitary and the reference time is common, that is, the origin of the discrete-time axis is the same as that of the continuous-time axis. Therefore, the operation of the LFDs is based on the local discrete-time models, which are the discrete-time version of local models (1):

$$\begin{aligned} \begin{aligned} x_I(k+1)=f_I(x_I(k),u_I(k))+g_I(x_I(k),&z_I(k),u_I(k)) + \eta _I(x_I(k),z_I(k),u_I(k)) \\&+ \beta _I(k-k_0)\phi _I(x_I(k),z_I(k),u_I(k)) \, , \end{aligned} \end{aligned}$$
(40)

where \(\phi _I\) describes the local discretized fault effects, occurring at some discrete-time \(k_0\) (that is, \(\beta _I(k-k_0)\phi _I(x_I(k),z_I(k),u_I(k))=0, k<k_0\)). Each LFD exchanges information with neighboring LFDs by means of the second level communication network (see right side of Figs. 2 and 4). As we will see in the following, the exchanged information consists in the re-synchronized interconnection variables \(v_J\). In Fig. 4, an example of a two LFDs architecture is presented to provide more insight into the structure of the proposed scheme.

Fig. 4
figure 4

An example of a two LFDs architecture. The internal structure of each LFD is shown (similarly as in [14]), composed of two buffers (the measurements buffer and the diagnosis buffer) to collect the information received, respectively, by the local sensors and neighboring LFDs, the Virtual Sensor (processing the received measurements), and the Fault Detection unit, responsible for the monitoring analysis. The communicated information between LFDs is represented

In summary, two different and not reliable communication networks are considered in this work: the first level communication network allows each LFD to communicate with its local sensors and the second level communication network allows the communication between different LFDs for detection purposes. Both these communication networks may be subject to delays and packet losses. Given the different nature of the networks (the first is local, while the second is connecting different subsystems, which may be geographically apart), in the next section we provide two different strategies to manage communication issues: a re-synchronization method for the first level communication network and a delay compensation strategy for the second level communication network.

4.1 Re-Synchronization at Diagnosis Level

Let us consider a state variable \(x_I^{(i)}(t)\); as mentioned before, at time \(t=t_{sI}^{(i)}\) the sensor \(S_I^{(i)}\) takes the measurement \( m_I^{(i)}(t_{sI}^{(i)})\) and sends it to the I-th LFD with a time stamp \(t_{sI}^{(i)}\). The I-th diagnoser receives the measurement sent by \(S_I^{(i)}\) at time \(t_{aI}^{(i)} > t_{sI}^{(i)}\). Since the LFDs run the distributed fault diagnosis algorithm with respect to a discrete-time framework associated with an integer k (see (40)), an online re-synchronization procedure has to be carried out at the diagnosis level. Moreover, the possible time-varying delays and packet losses introduced by the communication networks between the local sensors and the corresponding LFDs have to be addressed since they may affect the fault diagnosis decision. Note that, the classical discrete-time FD architecture assumes that quantities sampled at exactly time k are used to compute quantities related to time \(k+1\). Unfortunately, the LFDs may receive measurements associated with time instants different from k, because of transmission delays and because of the arbitrary sampling time instants of the sensors. The availability of the time stamp \(t_{sI}^{(i)}\) enables each LFD to implement a set of local virtual sensors by which the re-synchronization of the measurements received at the Diagnosis level is implemented. We assume that sensors and diagnosers share the same clock at the local level.Footnote 1

Specifically, each LFD collects the most recent sensors measurements in a buffer and computes a projection \(\hat{m}_I^{(i)}(k|t_{sI}^{(i)})\) of these latest available measurements \({m}_I^{(i)}(t_{sI}^{(i)})\), \(i=1,\dots ,n_I\), to the discrete-time instantFootnote 2 \(k \ge t_{aI}^{(i)}> t_{sI}^{(i)}\), by integrating the local nominal model on the time interval \([t_{sI}^{(i)},k]\).

Remark 2

Let us note that measurements may be related to and could be received also before time \(k-1\), without any assumption on the delay length, thus allowing the possibility of measurement packet losses. Moreover, thanks to the use of the time stamps and the buffers, “out-of-sequence” packets can be managed. The same measurement could be used by the virtual sensor more than once to obtain more than one projections related to different discrete-time instants.

The projected measurement \(\hat{m}_I^{(i)}(k|t_{sI}^{(i)})\) can be computed by noticing that, under healthy mode of behavior, the local nominal model (1) for the state component i at any time \(t>t_{sI}^{(i)}\) can be rewritten as

$$\begin{aligned} x_I^{(i)}(t)=x_I^{(i)}(t_{sI}^{(i)})&+\int _{t_{sI}^{(i)}}^{t} [ f_I^{(i)}(x_I(\tau ),u_I(\tau ))+g_I^{(i)}(x_I(\tau ),z_I(\tau ),u_I(\tau ))\\&+\eta _I^{(i)}(x_I(\tau ),z_I(\tau ),u_I(\tau ))] d\tau \, . \end{aligned}$$

Hence, the LFD implements a virtual sensor that generates an estimate of the measurement at discrete-time k given by

$$\begin{aligned} \begin{aligned} \hat{m}_I^{(i)}(k|t_{sI}^{(i)})&=m_I^{(i)}(t_{sI}^{(i)}) \\&\quad +\int _{t_{sI}^{(i)}}^{k} [ f_I^{(i)}(\hat{m}_I(\tau |t_{sI}^{(i)}),u_I(\tau )){+}{g}_I^{(i)}(\hat{m}_I(\tau |t_{sI}^{(i)}),\hat{m}_{zI}(\tau |t_{sI}^{(i)}),u_I(\tau \!)) \quad \\&\quad +\hat{\eta }_I^{(i)}(\hat{m}_I(\tau |t_{sI}^{(i)}),\hat{m}_{zI}(\tau |t_{sI}^{(i)}),u_I(\tau )) ] d\tau \, , \end{aligned} \end{aligned}$$
(41)

where \(\hat{\eta }_I\) characterizes an adaptive approximator designed to learn the unknown modeling uncertainty function \(\eta _{I}\) [27] and \(\hat{m}_{zI}\) are the projections of the measured interconnection variables \(m_{zI}\). An example enhancing the re-synchronization procedure for one LFD monitoring a subsystem with three state variables is illustrated in Fig. 5.

Fig. 5
figure 5

The re-synchronization procedure [14] needed to manage delays and packet losses in the communication networks between each LFD and its local sensors. A single LFD is considered whose local model depends on three variables, which are measured by three different sensors. The clock signals of each layer involved are shown

Remark 3

It is worth noting that the discrete-time index \(k\in \mathbb {Z}\) represents kind of a “virtual Time Stamp” (vTS) computed by the LFDs after the re-synchronization task and communicated in the second level communication network between LFDs. This will be exploited in Sect. 4.2.

Remark 4

Although in (41), for analysis purposes, \(\hat{\eta }_I\) represents the output of a continuous-time adaptive approximator, for implementation reasons, a suitable discrete-time approximator will be used, designed as explained in Sect. 4.4.

The above-described projection and re-synchronization procedure gives rise to an additional source of measurement uncertainty: the virtual measurement error, which is defined as

$$ \xi _I^{(i)}(k) \triangleq \hat{m}_I^{(i)}(k|t_{sI}^{(i)})-x_I^{(i)}(k). $$

For the sake of analysis, it is worth noting that, due to synchronization and measurement noise, the virtual measurement error is given by

$$\begin{aligned} \begin{aligned} \xi _I^{(i)}(k)&= m_I^{(i)}(t_{sI}^{(i)})-x_I^{(i)}(t_{sI}^{(i)}) \\&{\quad } + \int _{t_{sI}^{(i)}}^{k} [ \varDelta _{synch} f_I^{(i)}(\tau )+\varDelta _{synch} g_I^{(i)}(\tau ) +\varDelta _{synch} \eta _I^{(i)}(\tau )] d\tau \\&= w_I^{(i)}(t_{sI}^{(i)})+\int _{t_{sI}^{(i)}}^{k} [ \varDelta _{synch} f_I^{(i)}(\tau )+\varDelta _{synch} g_I^{(i)}(\tau ) +\varDelta _{synch} \eta _I^{(i)}(\tau )] d\tau \, , \end{aligned} \end{aligned}$$
(42)

where

$$\begin{aligned} \varDelta _{synch} f_I^{(i)}(\tau ) \triangleq f_I^{(i)}(\hat{m}_I(\tau |t_{sI}^{(i)}),u_I(\tau )) -f_I^{(i)}(x_I(\tau ),u_I(\tau )) \, , \end{aligned}$$
$$\begin{aligned} \varDelta _{synch} g_I^{(i)}(\tau ) \triangleq {g}_I^{(i)}(\hat{m}_I(\tau |t_{sI}^{(i)}),\hat{m}_{zI}(\tau |t_{sI}^{(i)}),u_I(\tau )) - g_I^{(i)}(x_I(\tau ),z_I(\tau ),u_I(\tau )) \, , \end{aligned}$$

and

$$\begin{aligned} \varDelta _{synch} \eta _I^{(i)}(\tau ) \triangleq \hat{\eta }_I^{(i)}(\hat{m}_I(\tau |t_{sI}^{(i)}),\hat{m}_{zI}(\tau |t_{sI}^{(i)}),u_I(\tau )) - \eta _I^{(i)}(x_I(\tau ),z_I(\tau ),u_I(\tau )) \, . \end{aligned}$$

For notational convenience, we now collect the projected measurements \(\hat{m}_I^{(i)}(k|t_{sI}^{(i)})\) in a vector, which, in the following, we denote as \(y_I(k)\), with k being its vTS:

$$ y_I(k)=\mathrm{col} \, \left\{ \hat{m}_I^{(i)}(k|t_{sI}^{(i)}), i=1,\dots ,n_I\right\} \, . $$

Therefore, it is as if the virtual sensor implemented by the LFDs takes uncertain local measurements \(y_I\) of the state \(x_I\), according to

$$ y_I(k)=x_I(k)+\xi _I(k), $$

where \(\xi _I\) is the unknown virtual measurement error (42). Moreover, in place of the interconnection variables \(z_I\), only the vector

$$\begin{aligned} v_I(k)=z_I(k)+\varsigma _I(k) \end{aligned}$$

is available for diagnosis, as it is possible to see in Fig. 6, where \(\varsigma _I\) is composed by the components of \(\xi _J\) affecting the relevant components of \(y_J\) (as before, J refers to a neighboring subsystem). For simplicity, we assume here that the control signal \(u_I\) is available to the diagnoser without any delays or other uncertainty.

Fig. 6
figure 6

An example of the multi-layer fault detection architecture. The interconnection variables \(z_I\) and the corresponding projected measurements \(v_I\) communicated among the diagnosers

The virtual measuring errors \(\xi _I\) and \(\varsigma _I\) are unstructured and unknown. For each \(i=1,\dots ,n_I\) and \(j=1,\dots ,\bar{n}_I\), it is possible to compute a bound for their components using (42):

$$ \left| \xi _I^{(i)}(k)\right| \le \bar{\xi }_I^{(i)}(k), \ \ \ \ \ \ \ \ \left| \varsigma _I^{(j)}(k)\right| \le \bar{\varsigma }_I^{(j)}(k), $$

where

$$\begin{aligned} \bar{\xi }_I^{(i)}(k)=\bar{w}_I^{(i)}(t_{sI}^{(i)})+\int _{t_{sI}^{(i)}}^{k}\bar{\varDelta }_{synch} f_I^{(i)}(\tau )+\bar{\varDelta }_{synch} g_I^{(i)}(\tau )+\bar{\varDelta }_{synch}\eta _I^{(i)}(\tau )d\tau \end{aligned}$$
(43)

is a positive function, \(\bar{w}_I^{(i)}\) is the one defined in Assumption 3,

$$\begin{aligned} \bar{\varDelta }_{synch} f_I^{(i)}(\tau ) =\max _{x_I\in {\mathscr {R}}^{n_I}}\left| f_I^{(i)}(\hat{m}_I(\tau ),u_I(\tau ))-f_I^{(i)}(x_I(\tau ),u_I(\tau ))\right| , \end{aligned}$$
$$\begin{aligned} \bar{\varDelta }_{synch} g_I^{(i)}(\tau ) =\max _{x_I\in {\mathscr {R}}^{n_I},z_I\in {\mathscr {R}}^{\bar{n}_I}}\left| g_I^{(i)}(\hat{m}_I(\tau ),\hat{m}_{zI}(\tau ),u_I(\tau ))-g_I^{(i)}(x_I(\tau ),z_I(\tau ),u_I(\tau ))\right| , \end{aligned}$$

remembering that the sets \(\mathscr {R}^{n_I}\), \(\mathscr {R}^{\bar{n}_I}\) are the domain of the state and interconnection variables, respectively, and \(\bar{\varDelta }_{synch} \eta _I^{(i)}(\tau )\) can be computed in an analogous way as in (65) (see Sect. 4.6). The bound \(\bar{\varsigma }_I\) is computed with the same procedure by the neighboring subsystems. In the next section, the fault diagnosis procedure is presented.

4.2 The Distributed Fault Detection Methodology

For fault detection purposes, each LFD communicates with neighboring LFDs. It is assumed that the inter-LFD communication is carried over a packet-switched network, which we call the second level communication network, possibly subject to packet delays and losses. In order to manage delays in this network, the data packets are Time Stamped, with the virtual Time Stamp, which contains the time instant the virtual measurements are referred to. In this layer, we assume to have perfect clock synchronization between the LFDs. In this way, all the devices of the monitoring architecture can share the same clock, that is, they know the reference time, and the use of Time Stamps can be valid.

Furthermore, we propose to provide each LFD with a buffer to collect the variables sent by neighbors. In the following, we denote with the superscript “b” the most recent value of a variable (or of a communicated function value) in the corresponding buffer of a given LFD; for example, \(v^b_I\) denotes the most recent value of the measured interconnection vector \(v_I\) contained in the buffer of the I-th LFD, while \([f_I(\cdot )]^b\) denotes the most recent value of the function \([f_I(\cdot )]\) in the buffer.

Each LFD computes a nonlinear adaptive estimate \(\tilde{x}_I\) of the associated monitored subsystem state \(x_I\). The local estimator, called Fault Detection Approximation Estimator (FDAE), is based on the local discrete-time nominal model (Eq. (40)). Similarly to what done in the first part of this chapter (Sect. 3), to dampen the effect of the virtual measurement error \(\xi _I(k)\), each measured variable \(y_I^{(i)}=x_I^{(i)}+\xi _I^{(i)}\) is filtered by H(z), where H(z) is a p-th order, asymptotically stable filter (poles lie inside the open unit disc \(| z |=1\)) with proper transfer function

$$\begin{aligned} H(z)=\frac{d_0 + d_1 z^{-1} + d_2 z^{-2} + \ldots +d_p z^{-p}}{1 +c_1 z^{-1} +\ldots + c_p z^{-p}}. \end{aligned}$$
(44)

Generally, each measured variable \(y_I^{(i)}(k)\) can be filtered by a different filter but, without loss of generality, we consider H(z) to be the same for all the output variables, in order to simplify notation and presentation. In addition, note that the form of H(z) allows both IIR and FIR types of digital filters. The filter H(z) can be written as \(H(z)=zH_p(z)\) where \(H_p(z)\) is the strictly proper transfer function

$$\begin{aligned} H_p(z)=\frac{d_0 z^{-1} + d_1 z^{-2} + d_2 z^{-3} + \ldots +d_p z^{-(p+1)}}{1 +c_1 z^{-1} +\ldots + c_p z^{-p}}. \end{aligned}$$
(45)

Note that, the filter \(H_p(z)\) is also asymptotically stable since it comprises of the same poles as H(z) with an additional pole at \(z=0\) (inside \(| z |=1\)). Since the filters H(z) and \(H_p(z)\) (with impulse responses h(t) and \(h_p(t)\), respectively) are asymptotically stable, they are also BIBO stable. Therefore, for bounded virtual measurement error \(\xi _I(k)\), the filtered virtual measurement errorFootnote 3 \(\varXi _I(k) \triangleq H(z)\left[ \xi _I(k)\right] \) is bounded as follows:

$$\begin{aligned} \left| \varXi _I^{(i)}(k) \right| \le \bar{\varXi }_I^{(i)}(k) \quad i=1,\ldots ,n_I \end{aligned}$$
(46)

where \(\bar{\varXi }_I^{(i)}\) are bounding functions that can be computed as \(\bar{\varXi }_I^{(i)} \triangleq \bar{H}(z) [\bar{\xi }_I^{(i)}]\), being \(\bar{H}(z)\) a filter with impulse response \(\bar{h}(k)\) that satisfies \( \left| h(k)\right| \le \bar{h}(k)\) and using Eq. (43). The selection of suitable filters \(\bar{H}(z)\) can be made by utilizing the methods indicated in Sect. 4.7. Note that we denote with capital letters the filtered signals.

4.3 Fault Detection Estimation and Residual Generation

In this subsection, we present a method for computing the local state estimate \(\tilde{x}_I\) for fault detection purposes. The local estimation \(\tilde{x}_I^{(i)}\) is given by

$$\begin{aligned}&\tilde{x}_{I}^{(i)}(k+1)=f_{I}^{(i)}(y_{I}(k),u_{I}(k))+g_{I}^{(i)}(y_{I}(k),v^b_{I}(k),u_{I}(k))\nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \quad +\hat{\eta }_{I}^{(i)}(y_{I}(k),v^b_{I}(k),u_{I}(k),\hat{\vartheta }_{I}(k)), \end{aligned}$$
(47)

with initial condition \(\tilde{x}_{I}^{(i)}(0) = y_I^{(i)}(0)\), where \(\hat{\eta }_{I}\) is the output of an adaptive approximator designed in Sect. 4.4 to learn the unknown modeling uncertainty function \(\eta _{I}\), \(\hat{\vartheta }_{I}\in \hat{\varTheta }_{I}\) denotes its adjustable parameters vector and \(t_b\) is the virtual time stamp of the most recent information received \(v^b_I\) in the buffer at time k.

The local estimation residual error \(r_I(k)\) is defined as

$$\begin{aligned} r_I(k) \triangleq Y_I(k) - \widehat{Y}_I(k), \end{aligned}$$
(48)

where we obtain the filtered output \(Y_I(k)\) by locally filtering the measurement output signal \(y_I(k)\)

$$\begin{aligned} Y_{I}(k)\triangleq H(z) \left[ y_{I}(k) \right] , \end{aligned}$$
(49)

and the output estimates as

$$\begin{aligned} \widehat{Y}_I(k)\triangleq H(z) \left[ \tilde{x}_{I}(k) \right] . \end{aligned}$$
(50)

The residual constitutes the basis of the fault detection scheme. It can be compared, component by component, to a suitable adaptive detection threshold \(\bar{r}_{I}\in \mathbb {R}^{n_I}\), thus generating a local fault decision attesting the status of the subsystem: healthy or faulty. A fault in the overall system is said to be detected when \(|r_I^{(i)}(k)| > \bar{r}_I^{(i)}(k)\), for at least one component i in any I-th LFD.

We now analyze the filtered measurements and estimates:

$$\begin{aligned} Y_{I}(k)&=H(z) \left[ y_{I}(k) \right] =H(z) \left[ x_{I}(k) + \xi _{I}(k) \right] \nonumber \\&=H_p(z) \left[ z\left[ x_I(k) \right] \right] + \varXi _I(k). \end{aligned}$$
(51)

In the absence of any faults (i.e., \(\phi _I\big (x_I(k),z_I(k),u_I(k)\big )=0\)), (51) becomes

$$\begin{aligned} Y_I(k)&= H_p(z) \big [ x_I(k+1) +z\big [x_I(0) \delta (k)\big ] \big ] + \varXi _I(k) \nonumber \\&=H_p(z) \big [ f_I\big (x_I(k),u_I(k)\big ) + g_I\big (x_I(k),z_I(k),u_I(k)\big ) \nonumber \\&\quad + \eta _I\big (x_I(k),z_I(k),u_I(k)\big )\big ] + h(k) x_I(0) + \varXi _I(k), \end{aligned}$$
(52)

where \(\delta (k)\) denotes the discrete-time unit-impulse sequence.

The filtered output estimation model for \(Y_I\), denoted by \(\widehat{Y}_I\), can be analyzed from the estimate provided by (47) as follows:

$$\begin{aligned} \widehat{Y}_{I}^{(i)}(k) =&H_p(z) \bigg [ f_I^{(i)}\big (y_I(k),u_I(k)\big ) + {g}_I^{(i)}\big (y_I(k),v_I^b(k),u_I(k)\big )\nonumber \\&+ \hat{\eta }_I^{(i)}\big (y_I(k),v_I^b(k),u_I(k),\hat{\vartheta }_I(k)\big ) \bigg ]+ h(k) y_I^{(i)}(0). \end{aligned}$$
(53)

Therefore, the residual (48) is readily computable from (49) and (50). The residual is analyzed in Sect. 4.6 to obtain a suitable adaptive detection threshold. Now, we design the adaptive approximator \(\hat{\eta }_{I}\), needed to compute the state estimate (47) and hence (50).

4.4 Learning of the Modeling Uncertainty

Reducing the modeling uncertainty enables improved detection thresholds which, in turn, results in better detection capabilities. In this subsection, we consider the design of a nonlinear adaptive approximator, exploiting the variables available in the local buffers in each LFD to manage communication delays (the details of the delay compensation strategy are given in Sect. 4.5). The structure of the linear in the parameters nonlinear multivariable approximator is not dealt with in this chapter (nonlinear approximation schemes like neural networks, fuzzy logic networks, wavelet networks, spline functions, polynomials, etc., can be used).

As shown later on in this subsection, adaptation of the parameters \(\hat{\vartheta }_{I}\) of the approximator is achieved through the design of a dynamic state estimator which takes on the form:

$$\begin{aligned} \hat{x}_{I}^{(i)}(k+1)=\lambda (\hat{x}_{I}^{(i)}(k)-y_I^{(i)}(k)) +f_{I}^{(i)}(y_{I},u_{I})+g_{I}^{(i)}(y_{I},v^b_{I},u_{I}) +\hat{\eta }_{I}^{(i)}(y_{I},v^b_{I},u_{I},\hat{\vartheta }_{I}), \end{aligned}$$
(54)

where \(0<\lambda <1\) is a design parameter. Let us introduce the estimation error

$$ \varepsilon _{I}(k)\triangleq y_{I}(k)-\hat{x}_{I}(k) $$

We compute the i-th state estimation error component as follows:

$$\begin{aligned} \begin{aligned} \varepsilon ^{({i})}_{I}&(k+1)=y_{I}^{({i})}(k+1)-\hat{x}_{I}^{(i)}(k+1) \\&=\lambda \varepsilon _{I}^{(i)}+\varDelta f^{(i)}_{I}+\varDelta g^{(i)}_{I}+\varDelta \eta ^{(i)}_{I}-\lambda \xi ^{(i)}_{I} +\lambda \xi ^{(i)}_{I}(k)+\xi ^{(i)}_{I}(k+1) \, , \end{aligned} \end{aligned}$$
(55)

where

$$ \varDelta f^{(i)}_{I}\triangleq f_{I}^{(i)}(x_{I},u_{I})-f_{I}^{(i)}(y_{I},u_{I})\, , $$
$$ \varDelta g^{(i)}_{I} \triangleq g_{I}^{(i)}(x_{I},z_{I},u_{I})-g_{I}^{(i)}(y_{I},v^b_{I},u_{I}) \, , $$

and

$$ \varDelta \eta ^{(i)}_{I} \triangleq \eta _{I}^{(i)}(x_{I},z_{I},u_{I})-\hat{\eta }_{I}^{(i)}(y_{I},v^b_{I},u_{I},\hat{\vartheta }_{I}) \, . $$

From this equation, the following learning law can be derived using Lyapunov stability techniques (see [107]) for every I:

$$\begin{aligned} \hat{\vartheta }_{I}(k+1)=P_{\hat{\varTheta }_{I}}\left[ \hat{\vartheta }_{I}(k)+\gamma _{I}L^{\top }_{I}[\varepsilon _{I}(k+1)-\lambda \varepsilon _I(k)\right] \, , \end{aligned}$$
(56)

where \(L^{\top }_{I}=\partial \hat{\eta }_{I}/\partial \hat{\vartheta }_{I}\) is the gradient matrix of the online approximator with respect to its adjustable parameters and \(\gamma _{I}={\mu _{I}}/{\rho _{I}+\left\| L^{\top }_{I}\right\| ^{2}_{F}}\), with \(P_{\hat{\varTheta }_{I}}\) being a projection operator restricting \(\hat{\vartheta }_{I}\) within \(\hat{\varTheta }_{I}\) [68], \(\Vert \cdot \Vert _{F}\) denotes the Frobenius norm and \(\rho _{I}>0\), \(0<\mu _{I}<2\) are design constants that guarantee the stability of the learning law [68].

4.5 Delay Compensation Strategy

Next, we analyze the properties of the Fault Detection estimator introduced in Sect. 4.3, where the filtered measurements are used; in particular, we explain how the estimator manages delays and packet losses in the second-level communication network between diagnosers.

In order to compute (47) and (54), the generic J-th diagnoser communicates to the neighboring LFDs the current values of the variables \(v_I\). It is worth noting that this information exchange between diagnosers can be affected by time-varying delays and packet losses and hence a compensation strategy has to be devised. The delay compensation strategy is derived without any assumption on the delay length, thus eventually dealing with the problem of packet losses and “out-of-sequence” packets. We assume that the communication network between diagnosers is designed so to avoid pathological scenarios, such as, for example, a situation in which the communication delay is always larger than the sampling time. It is important to note that a re-synchronization strategy like the one used in the first level communication networks cannot be used in this case, since here we consider data exchanged between different LFDs, and each LFD, of course, does not know the model of neighboring subsystems.

As in [12], thanks to the use of the virtual Time Stamps, the most recent measurements and information are considered. When a data packet arrives, its virtual Time Stamp \(v_\mathrm{TS}\) is compared to \(t_b\), which is the virtual Time Stamp of the information already in the buffer. If \(v_\mathrm{TS} > t_b\), then the novel data packet takes its place in the buffer and \(t_b\leftarrow v_\mathrm{TS}\). At time \(t_c\), with \(k<t_c<k+1\), each LFD computes the estimates for the time instant \(k+1\) using information referred to time k. A variable in the buffer is up to date if \(t_b=k\). Should a delay or a packet loss occur in the second level communication network, we proceed as follows. If some of the interconnection variables are not up to date, that is \(t_b<k\), then the learning of the modeling uncertainty function \(\eta _I\) (56) is temporarily paused. Anyway, not up to date interconnection variables are used to compute the local value of the interconnection function in the state estimators (47) and (54), but this error is taken into account in the computation of the detection threshold, as will be seen in the following subsection.

4.6 Detection Threshold

In order to define an appropriate threshold for the detection of faults, we now analyze the dynamics of the output estimation error when the system is under healthy mode of behavior. Since, from (52) we have

$$\begin{aligned} \begin{aligned} Y_{I}^{(i)}(k) = H_p(z) \big [ f_I^{(i)}&\big (x_I(k),u_I(k)\big ) + g_I^{(i)}\big (x_I(k),z_I(k),u_I(k)\big ) \\&+ \eta _I^{(i)}\big (x_I(k),z_I(k),u_I(k)\big )\big ] + h(k) x_I^{(i)}(0) + \varXi _I^{(i)}(k), \end{aligned} \end{aligned}$$
(57)

we are able to compute the residual defined in (48) by using (53) and (57):

$$\begin{aligned} r_I^{(i)}(k) = \left[ \chi _I^{(i)}(k) \right] ^b - \xi _I^{(i)}(0)h(k) + \varXi _I^{(i)}(k) \, , \end{aligned}$$
(58)

where the total uncertainty term \(\chi _I^{(i)}(k)\) is defined as

$$\begin{aligned} \chi _I^{(i)}(k) \triangleq H_p(z) \big [ \varDelta f_I^{(i)}(k) + \varDelta g_I^{(i)}(k)+\varDelta \eta _I^{(i)}(k) \big ]. \end{aligned}$$
(59)

The function error \(\varDelta \eta _I\) can be computed as the sum of four different terms:

$$\begin{aligned} \varDelta \eta _I=L_I\tilde{\vartheta }_I+\upsilon _I+\varDelta \hat{\eta }_I+\varDelta \eta _I^{\tau } \, . \end{aligned}$$
(60)

The first term takes into account the error due to the parameters’ estimation. This error can be characterized by introducing an optimal weight vector [98] \(\hat{\vartheta }^{*}_I\) as follows:

$$\begin{aligned} \hat{\vartheta }^{*}_I \triangleq \arg \min _{\hat{\vartheta }_I}\sup _{x_I,z_I,u_I}\left\| \eta _I(x_I,z_I,u_I)- \hat{\eta }_I(x_I,z_I,u_I,\hat{\vartheta }_I\right\| , \end{aligned}$$
(61)

with \(\hat{\vartheta }_I,x_I, z_I, u_I\) taking values in their respective domains, and by defining the parameter estimation error

$$ \tilde{\vartheta }_I \triangleq \hat{\vartheta _I^{*}}-\hat{\vartheta }_I \, . $$

The second term in (60) is the so-called Minimum Functional Approximation Error \(\upsilon _I\), which describes the least possible approximation error that can be obtained at time k if \(\hat{\vartheta }_I\) were optimally chosen:

$$ \upsilon _I(k) \triangleq \eta _I(x_I,z_I,u_I)-\hat{\eta }_I(x_I,z_I,u_I,\hat{\vartheta }_I^{*}) \, . $$

Then, a term representing the error caused by the use of the uncertain measurements instead of the actual values of the state variables is defined:

$$ \varDelta \hat{\eta }_I \triangleq \hat{\eta }_I(x_I,z_I,u_I,\hat{\vartheta }_I)-\hat{\eta }_I(y_I,v_I,u_I,\hat{\vartheta }_I) \, . $$

Finally, the estimation error due to the use of delayed measurements is taken into account by

$$ \varDelta \eta _I^{\tau }\triangleq \hat{\eta }_I(y_I,v_I,u_I,\hat{\vartheta }_I)-\hat{\eta }_I(y_I,v^b_I,u_I,\hat{\vartheta }_I) \, $$

where \(v_I\) is the current measured variable and \(v^b_I\) is the value in the buffer, which is “old” in the presence of delays. Clearly, \(\varDelta \eta _I^{\tau } = 0\) when up to date measurements are used (in this case, \(v^b_I=v_I\)).

Using (60), the total uncertainty term \(\chi _I^{(i)}(k)\) in (59) can be rewritten as

$$\begin{aligned} \begin{aligned} \chi _I^{(i)}(k) \triangleq H_p(z) \big [ \varDelta f_I^{(i)}(k) +\varDelta g_I^{(i)}(k)+ L_I^{(i)}\tilde{\vartheta }_I(k)&+ \upsilon _I^{(i)}(k)\\&+ \varDelta \hat{\eta }_I^{(i)}(k)+\varDelta \eta _I^{\tau (i)}(k) \big ], \end{aligned} \end{aligned}$$
(62)

where \(L_I^{(s_I)}\) indicates the \(s_I\)-th line of the matrix \(L_I\). Using the triangle inequality, (58) satisfies:

$$\begin{aligned} \left| r_I^{(i)}(k) \right|&\, \le \left| \bigg [ \chi _I^{(i)}(k) \bigg ]^b \right| +\left| \xi _I^{(i)}(0)h(k) \right| + \left| \varXi _I^{(i)}(k)\right| \nonumber \\&\, \le \bigg [ \left| \chi _I^{(i)}(k)\right| \bigg ]^b +\bar{\xi }_I^{(i)}(0) \left| h(k)\right| + \bar{\varXi }_I^{(i)}(k). \end{aligned}$$
(63)

From (62) and using again the triangle inequality, we can obtain

$$\begin{aligned}&\left| \chi _I^{(i)}(k)\right| \le \left| H_p(z) \big [\varDelta f_I^{(i)}(k) + \varDelta g_I^{(i)}(k) + \varDelta \eta _I^{(i)}(k)\big ]\right| \nonumber \\&\quad \le \sum _{n=0}^{k} \left| h_p(k-n)\right| \left| \varDelta f_I^{(i)}(n) +\varDelta g_I^{(i)}(n) + L_I^{(i)}\tilde{\vartheta }_I(n) +\upsilon _I^{(i)}(n)\right. \nonumber \\&\left. \qquad +\varDelta \hat{\eta }_I^{(i)}(n)+\varDelta \eta _I^{\tau (i)}(n)\right| \nonumber \\&\quad \le \bar{\chi }_I^{(i)}(k)\triangleq \bar{H}_p(z) \big [ \bar{\varDelta }f_I^{(i)} (k)+ \bar{\varDelta }g_I^{(i)}(k)+ \bar{\varDelta }\eta _I^{(i)}(k) \big ], \end{aligned}$$
(64)

where \(\bar{H}_p(z)\) is the transfer function with impulse response that satisfies \(\left| h_p(k)\right| \le \bar{h}_p(k)\) (more details for the selection of \(\bar{H}_p(z)\) are given in Sect. 4.7),

$$ \bar{\varDelta }f_I^{(i)}(k) \triangleq \max _{\left| \xi _I\right| \le \bar{\xi }_I}\left\{ \left| \varDelta f_I^{(i)}(k)\right| \right\} ,$$
$$ \bar{\varDelta }g_I^{(i)}(k) \triangleq \max _{\left| \xi _{I}\right| \le \bar{\xi }_I(k)}\max _{\left| \varsigma _I\right| \le \bar{\varsigma }_I(k)}\left\{ \left| \varDelta g_I^{(i)}(k)\right| \right\} $$

and

$$\begin{aligned} \begin{aligned} \bar{\varDelta }\eta _I^{(i)}(k) \triangleq \left\| L_I^{(i)}\right\| \kappa _I(\hat{\vartheta }_I)&+\bar{\upsilon }_I^{(i)}(k)+\max _{\left| \xi _{I}\right| \le \bar{\xi }_I(k)}\max _{\left| \varsigma _I\right| \le \bar{\varsigma }_I(k)}\left| \varDelta \hat{\eta }_I^{(i)}(k)\right| \\&+\max _{v_I\in \mathscr {R}^v}\left| \hat{\eta }_I^{(i)}(y_I,v_I,u_I,\hat{\vartheta }_I)-\hat{\eta }_I^{(i)}(y_I,v^b_I(t_b),u_I,\hat{\vartheta }_I)\right| , \end{aligned} \end{aligned}$$
(65)

with \(\bar{\upsilon }_I\) denoting a bound to the minimum functional approximation error, the function \(\kappa _I\) being such that \(\kappa _I(\hat{\vartheta }_I)\ge \left\| \tilde{\vartheta }_I\right\| \) and \(\mathscr {R}^{v_I}\subset \mathbb {R}^{\bar{\eta }_I}\), where this last term represents a local domain of the interconnection variable and is communicated by the neighboring LFDs at \(k=0\). It is important to remark that \(\mathscr {R}^{v_I}\) coincides with the domain \(\mathscr {D}_{z_I}\) for subsystem I. Thanks to the way the threshold is designed from (63), it is straightforward that it guarantees the absence of false alarms, since the residual prior to the fault occurrence always satisfies

$$ \left| r_I^{(i)}(k) \right| \le \bar{ r}_I^{(i)}(k) \, , $$

where the detection threshold \(\bar{ r}_I^{(i)}\) is defined as

$$\begin{aligned}&\bar{ r}_I^{(i)}(k) \triangleq \bigg [ \bar{\chi }_I^{(i)}(k) \bigg ]^b +\bar{\xi }_I^{(i)}(0) \left| h(k)\right| + \bar{\varXi }_I^{(i)}(k). \end{aligned}$$
(66)

Remark 5

Notice that, even in the case of a conservative bound \(\bar{\xi }_I^{(i)}\), the second term \(\bar{\xi }_I^{(i)} \left| h(k)\right| \) affects the detection threshold only during the initial portion of the transient (the impulse response h(k) of the filter H(z) decays exponentially). Moreover, the term \(\bar{\varXi }_I^{(i)}\) in (65) takes into account the uncertainty due to the delays in the communication network between LFDs. This term is instrumental to ensure the absence of false alarms caused by these communication delays.

Remark 6

The terms \(\bar{\xi }_I(k)\) and \(\bar{\varsigma }_I(k)\) are computed by the LFDs at each time step after the re-synchronization task (see (43)) and are available to compute the fault detection threshold.

Remark 7

Admittedly, the bounds used in (64) and (65) give rise to conservative thresholds but have the advantage of guaranteeing the absence of false-positive alarms and of being easily computable requiring a small amount of data to be exchanged between the LFDs. In the presence of a priori knowledge on the process to be monitored, tighter bound could be devised. For example, Lipschitz conditions on the local models could be easily exploited to devise tighter detection thresholds.

4.7 Selection of Filter \(\bar{H}_p(z)\)

A practical issue that requires consideration is the selection of the filter \(\bar{H}_p(z)\) whose impulse response must satisfy \(| h_p(t) | \le \bar{h}_p(t) \) as stated before. In the case where the impulse response \(h_p(t)\) is nonnegative, the selection \(\bar{H}_p(z)=H_p(z)\) is trivial. Sufficient conditions for nonnegative impulse response for a class of discrete-time transfer functions are given in [60]. In the following, we present two methods for choosing \(\bar{H}_p(z)\), one considering H(z) as a digital IIR filter and the other one as a FIR filter.

First, we consider the case where H(z) is an IIR filter. Due to the way \(H_p(z)\) was defined, \(H_p(z)\) is strictly proper and asymptotically stable. Hence, the impulse response \(h_p(k)\) satisfies \(| h_p(k) | \le \kappa \lambda ^k\) for all \(k \in \mathbb {N}\), for some \(\kappa >0\) and \(\lambda \in [0,1)\). Since \(| h_p(k) | \le \bar{h}_p(k) \) must hold, the impulse response \(\bar{h}_p(k)\) can be selected as \(\bar{h}_p(k)=\kappa \lambda ^k\) and thus \(\bar{H}_p(z) = \frac{\kappa }{1-\lambda z^{-1}}\).

Now, let us consider the case in which H(z) is a FIR filter. FIR filters have several advantages, as they are inherently stable and can easily be designed to be linear phase which corresponds to uniform delay at all frequencies. Let H(z) be a p-th order FIR filter given by \(H(z)=\sum _{n=0}^p{d_n z^{-n}}\). Therefore, \(H_p(z)=z^{-1}H(z)=\sum _{n=0}^p{d_n z^{-(n+1)}}\) and \(\bar{h}_p(k)\) can be selected as \(\bar{h}_p(k)=| h_p(k) |\) which leads to the FIR filter \(\bar{H}_p(z)=\sum _{n=0}^p{| d_n | z^{-(n+1)}}\).

4.8 The Local Fault Detection Algorithm

Now, all the elements needed to implement the fault detection scheme are available. For the sake of clarity, the implementation of the local fault detection methodology is sketched in the following Algorithm 1. Extensive simulation results showing the effectiveness of the presented approach can be found in [14].

figure b

4.9 Detectability Conditions

In this subsection, we address some sufficient conditions for detectability of faults by the proposed distributed networked fault detection scheme, thus considering the behavior of the fault detection algorithm in the case of a faulty system. We assume that at an unknown time \(k_0\) a fault \(\phi _I\) occurs. The fault detectability analysis constitutes a theoretical result that characterizes quantitatively (and implicitly) the class of faults detectable by the proposed scheme.

Theorem 4

(Fault Detectability) A fault in the I-th subsystem occurring at time \(k=k_0\) is detectable at a certain time \(k=k_d\) if the fault function \(\phi _I^{(i)}(x_I(k),z_I(k),u_I(k))\) satisfies the following inequality for some \(i=1,\ldots ,n_I\):

$$\begin{aligned} \left| \sum _{n=k_0}^{k_d} h_p(k-n) \phi _I^{(i)}\big (x_I(n),z_I(n),u_I(n)\big ) \right| > 2 \bar{r}_I^{(i)}(k_d). \end{aligned}$$
(67)

Proof

After fault occurrence, that is for \(k>k_0\), Eq. (58) becomes

$$\begin{aligned} \begin{aligned}&r_I^{(i)}(k) = \chi _I^{(i)}(k)^b +H_p(z) \big [ \phi _I^{(i)}\big (x_I(k),z_I(k),u_I(k)\big ) \big ] - \xi _I^{(i)}(0)h(k) + \varXi _I^{(i)}(k) \\&= \chi _I^{(i)}(k) ^b - \xi _I^{(i)}(0)h(k) + \varXi _I^{(i)}(k) + H_p(z) \big [ \phi _I^{(i)}\big (x_I(k),z_I(k),u_I(k)\big ) \big ]. \end{aligned} \end{aligned}$$
(68)

Using the triangle inequality, from (68) we can write

$$\begin{aligned} \begin{aligned} \left| r_I^{(i)}(k)\right| \ge&- \left| \chi _I^{(i)}(k)^b \right| - \left| \xi _I^{(i)}(0)h(k)\right| - \left| \varXi _I^{(i)}(k)\right| \\&+ \left| H_p(z) \big [ \phi _I^{(i)}\big (x_I(k),z_I(k),u_I(k)\big ) \big ] \right| \end{aligned} \end{aligned}$$
(69)

and by using a similar procedure as in the derivation of (66), (69) becomes

$$\begin{aligned} \left| r_I^{(i)}(k)\right|&\ge - \bar{r}_I^{(i)}(k) + \left| H_p(z) \big [ \phi _I^{(i)}\big (x_I(k),z_I(k),u_I(k)\big ) \big ] \right| . \end{aligned}$$
(70)

For fault detection at time \(k=k_d\), the inequality \(| r_I^{(i)}(k_d) | > \bar{r}_I^{(i)}(k_d)\) must hold for some \(i=1,\ldots ,n_I\), so the final fault detectability condition is obtained:

$$\begin{aligned} \left| H_p(z) \big [ \phi _I^{(i)}(x_I(k_d),z_I(k_d),u_I(k_d)) \big ] \right| > 2 \bar{r}_I^{(i)}(k_d). \end{aligned}$$

This can be rewritten in the summation form (67) of the Theorem.    \(\square \)

This theorem provides a sufficient condition for the implicit characterization of a class of faults that can be detected by the proposed fault detection scheme. Let us note that the detectability condition represents the minimum cumulative magnitude of the fault that can be detected under a specific trajectory of the system. It is possible to study this condition off line for representative trajectories of the system.

4.10 Identification of the Faulty Subsystem

In the next section, we consider the fault diagnosis problem. More specifically, we illustrate an approach for the adaptive learning of the local fault function after fault detection. Before developing the adaptive approximation procedure, we present an important remark.

A fundamental question regarding fault detectability is whether the fault that occurs in subsystem \(\varSigma _J\) is detectable not only by the LFD \(\mathscr {F}_J\), but also by the LFD \(\mathscr {F}_I\) of the neighboring subsystem \(\varSigma _I\), whose state is influenced by \(\varSigma _J\) dynamics.

It can be shown (the interested reader can refer to [52]), that the proposed fault detection scheme guarantees that, a process fault \(\phi _J(\cdot )\) occurring in subsystem \(\varSigma _J\) which affects \(\varSigma _I\), can only be detected by its corresponding LFD \(\mathscr {F}_J\) and not by the LFD \(\mathscr {F}_I\). This result is essentially the implication of using the measurements of the state and interconnection variables in the estimation model given by (11). Qualitatively, this can be explained as follows. When a process fault occurs in \(\varSigma _J\), the fault affects its states which in turn affect other subsystems through the interconnection variables. So, the states of \(\varSigma _J\) are “contaminated” by the process fault and the measurements of these states also contain the process fault effects. Therefore, a subsystem \(\varSigma _I\) that is affected by \(\varSigma _J\), is affected by the process fault that occurred in \(\varSigma _J\) through the interconnection variables \(z_I\) and the detection LFD \(\mathscr {F}_I\) makes use of the measurements \(v_I\) which are also “contaminated” by the same fault. Hence, the effect of the process fault that occurred in \(\varSigma _J\), is “canceled out” in the LFD \(\mathscr {F}_I\) and it is unable detect the fault. Hence, a process fault occurring in subsystem \(\varSigma _J\) is detectable only by its respective detection LFD \(\mathscr {F}_J\) and not by any other LFD \(\mathscr {F}_I\). This is a very important result because when a fault is detected in a subsystem, at the same time the faulty subsystem is identified, and further fault isolation/identification methods can be used targeting only the particular faulty subsystem.

5 Fault Diagnosis - Learning the Fault Function

After a fault is detected by the LFD \(\mathscr {F}_I\) at time \(T_d\), the fault isolation task is initiated to identify the type of fault occurring in the faulty subsystem \(\varSigma _I\). In order to do this, various approaches can be used, and two of them are discussed in the sequel.

5.1 Generalized Observer Scheme

A fault isolation logic can be implemented based on a Generalized Observer Scheme (GOS, see [33, 65]). As in [31], it is assumed that each subsystem knows a local fault set \({\mathscr {O}}_{I}\), collecting all the \(N_{{\mathscr {O}}_{I}}\) possible fault functions: \(\phi ^l_{I}(x_I, z_I,u_I)\), \(l\in \{1,\,\dots ,\, N_{{\mathscr {O}}_{I}}\}\). Once a fault is detected at time \(T_{d}\) in the I-th subsystem, the respective LFD \(\mathscr {F}_I\) activates \(N_{{\mathscr {O}}_{I}}\) estimators, where each filter is sensitive to a specific fault: the generic l-th fault isolation estimator of the I-th LFD is matched to the corresponding fault function \(\phi _{I}^l\), belonging to the local fault set \({\mathscr {O}}_{I}\). Each l-th estimator provides a local state estimate \({\hat{x}^l}_{I}\) of the local state \({x}_{I}\) affected by the l-th fault:

$$\begin{aligned} \begin{aligned} \hat{x}_{I}^{l(i)}(k+1)=\lambda (\hat{x}_{I}^{l(i)}(k)-y_I^{(i)}&(k)) +f_{I}^{(i)}(y_{I},u_{I})+g_{I}^{(i)}(y_{I},v^b_{I},u_{I}) \\&+\hat{\eta }_{I}^{(i)}(y_{I},v^b_{I},u_{I},\hat{\vartheta }_{I}(T_d))+\phi _{I}^{l(i)}(y_{I},v^b_{I},u_{I}), \end{aligned} \end{aligned}$$
(71)

where the learning of the modeling uncertainty has been stopped at time \(T_d\) in order not to learn the fault effect. The difference between the estimate \({\hat{x}^l}_{I}\) and the re-synchronized measurements \(y_{I}\), after filtering, consists of the fault isolation estimation residual \({r^{\ l}_I}\triangleq Y_{I}-{\hat{Y}^l}_{I}\), where \({\hat{Y}^l}_{I} \triangleq H(z)[\hat{x}_{I}^{l}(k)]\). This residual is compared, component by component, to some properly designed isolation thresholds \({\bar{r}^{\ l}}_{I}\) so that if the j-th fault (in the fault set \({\mathscr {O}}_I\)) has occurred, then it is guaranteed that

$$\begin{aligned} | r^{\ j(i)}_I (k) |\le {\bar{r}^{\ j(i)}}_{I}(k) \quad \forall \, k>T_d, i=1,\ldots ,n_I.\end{aligned}$$
(72)

The isolation thresholds are defined similarly as the detection threshold in (66), modifying \(\bar{\chi }_I^{(i)}(k)\) adding the following term:

$$ \bar{\varDelta }\phi _I^{l(i)}(k) \triangleq \max _{\left| \xi _{I}\right| \le \bar{\xi }_I(k)}\max _{\left| \varsigma _I\right| \le \bar{\varsigma }_I(k)}\left\{ \left| \varDelta \phi _I^{l(i)}(k)\right| \right\} ,$$

being \(\varDelta \phi _I^{l(i)}(k)=\phi _{I}^{l(i)}(x_{I},z_{I},u_{I})-\phi _{I}^{l(i)}(y_{I},v^b_{I},u_{I})\).

If a residual crosses its corresponding threshold, then we can exclude the occurrence of the considered l-th fault. Therefore, if we are able to exclude all the faults but one, then we can say that the fault is isolated.

5.2 Learning the Fault Function

In the case that the fault functions are not known a priori, we can use a different approach based on the adaptive learning of the fault function. According to the approximation model (54) introduced in Sect. 4.4 for learning the modeling uncertainty, when a fault is detected in the I-th subsystem, then the approximation model starts to learn the combined effect of the modeling uncertainty and the fault function. Assuming that the detection time \(T_d\) is sufficiently long, so that the modeling uncertainty is learned, its estimation is given by \(\hat{\eta }_{I}(y_{I}(k),v^b_{I}(k),u_{I}(k),\hat{\vartheta }_{I}(T_d))\). Therefore, by allowing a sufficiently long learning period \(T_L\) after the fault detection, the approximator \(\hat{\eta }_{I}\) learns the combined effect of the modeling uncertainty and the fault function as \(\hat{\eta }_{I}(y_{I}(k),v^b_{I}(k),u_{I}(k),\hat{\vartheta }_{I}(T_d+T_L))\) for \(k>T_d+T_L\). Therefore, the estimated fault function is given by \(\hat{\phi }_I(k) =\hat{\eta }_{I}(y_{I}(k),v^b_{I}(k),u_{I}(k),\hat{\vartheta }_{I}(T_d+T_L)) - \hat{\eta }_{I}(y_{I}(k),v^b_{I}(k),u_{I}(k),\hat{\vartheta }_{I}(T_d))\), \(k>T_d+T_L\). Note that, the fault could be incipient and still be developing at the end of the learning period, so the designer may let the learning process to continue. In this case, the estimated fault function is given by \(\hat{\phi }_I(k) =\hat{\eta }_{I}(y_{I}(k),v^b_{I}(k),u_{I}(k),\hat{\vartheta }_{I}(k)) - \hat{\eta }_{I}(y_{I}(k),v^b_{I}(k),u_{I}(k),\hat{\vartheta }_{I}(T_d))\), \(k>T_d+T_L\). The estimated fault function can then be used for fault accommodation purposes in order to guarantee the stability of the faulty system. For more information regarding this approach for learning the fault function, the interested reader can refer to [53].

6 Concluding Remarks

This chapter has reviewed a distributed fault diagnosis framework specifically designed for uncertain networked nonlinear large-scale systems concerning various sources of uncertainty, namely modeling uncertainty, measurement noise, and network-related uncertainties.

In order to deal with the presence of measurement noise, a filtering scheme has been presented by integrating a general class of filters into the design of the residual and threshold signals in a way that takes advantage of the filtering noise suppression properties. Essentially, filtering dampens the effect of measurement noise in a certain frequency range allowing to set tighter detection thresholds and thus enhancing fault detectability. The main implications of the filtering scheme is rigorously investigated providing insights on the impact of the filters’ poles and on the fault detection time.

The modeling uncertainties are also taken into account by means of an adaptive learning technique.

Furthermore, the chapter addressed the need for integration between the different levels composing CPS systems, by proposing a comprehensive architecture, where all parts of complex distributed systems are considered: the physical environment, the sensor level, the diagnosers layer and the communication networks. By adapting and incorporating the devised filtering scheme into the overall framework, a distributed fault diagnosis approach has been designed for distributed uncertain nonlinear large-scale systems to specifically address the issues emerging when considering networked diagnosis systems, such as the presence of delays and packet dropouts in the communication networks that degrade performance and could be a source of instability, misdetection, and false alarms. Multi-rate systems, where the measurements may not be synchronous, were also considered. Under the stated assumptions, the proposed architecture guarantees the absence of false-positive alarms.

Finally, some information was provided regarding the actions that can be taken after the detection of a fault in order to isolate the potential fault by identifying its location and magnitude, or even learning the fault function. Based on this information, actions can be taken in order to alleviate the fault effects and safeguard the system operation.

Modern, complex, interconnected systems can be prone to various sources of faults due to the increased complexity or even malicious attacks which can be considered as a “type” of fault. As a result, comprehensive fault diagnosis schemes need to be devised by considering the recent technological challenges, and this chapter has reviewed an integrated methodology which represents a step in that direction.