1 Introduction

The adoption of Wireless Sensor Networks (WSNs) in various industrial, healthcare, environmental monitoring, transportation, crisis management, and military surveillance applications have tremendously grown up over the last few decades. Ease of implementation, ability to operate in harsh environments, easy troubleshooting and repair, and high levels of performance are the main reasons behind the popularity of such networks. A WSN is a distributed, self configurable, ubiquitous, and infrastructure-less network, often composed of many tiny, low-cost, battery-powered sensor nodes. Each node has sensing, data processing, and communicating capabilities. The adversarial and harsh operational environments in which WSNs are deployed makes the sensor nodes error prone and hence, unreliable. A sensor node may have faults and measurement errors due to physical defects, imperfections or hardware and/or software related glitches. The application of WSNs in various critical scenarios makes it more demanding to detect faulty sensors and let all fault-free sensors to receive these faulty events; thus making the network still operational in presence of faults. The distributed fault diagnosis is intended to draw a consensus among the fault-free sensors about the status of all other sensors in the system. It acts as a basis for designing dependable systems by isolating the faulty sensors from the network. The fault-free sensors can proceed with their computations, however, with degraded performance. In this paper, we consider the problem of distributed fault diagnosis in WSNs.

Fault diagnosis has been a focused area of research since last few decades and was first explored by Preparata et al. [1] for wired networks with point to point communication links. Since then, many variants of this model have been proposed as discussed in the literature [26]. Comparison based model; the most favorable fault diagnosis mechanism has been the key discussion in [7, 8], where the decisions about the fault status of nodes are based on the comparison outcomes of the results of the same task executed by different nodes. The distributed fault diagnosis protocols for Mobile Ad hoc Networks (MANETs) are investigated in [911]. However, due to the harsh operational environments, sensor nodes fail more frequently than the nodes in other platforms. This makes the task of fault diagnosis more challenging in WSNs.

Replication of sensors for a node is treated to be the most ancient approach for fault tolerance in WSNs [12]. However, it increases the cost of a node and eventually the cost of the network along with the increased power consumption and network complexity. Jaikaeo et al. [13] have proposed a centralized fault diagnosis algorithm, addressing the response implosion problem in sensor network diagnosis. The main objective of their approach is to reduce the traffic at central manager. Lee et al. [14] have discussed another centralized scheme that uses a central manager provided with a global view of the network to reliably execute predefined corrective and preventive management maintenance. Nevertheless, the scheme suffers from certain limitations. It is non-scalable and cannot be advantageous for larger networks; central manager is the bottleneck due to high traffic. MANNA: a management architecture for fault detection in event driven WSNs is presented by Ruiz et al. [15]. This scheme puts a manager external to the network having the global knowledge of the network to detect the faulty events. However, it suffers from the disadvantages of a centralized approach. According to Ding et al. [16], neighbor coordination is another interesting approach to detect faulty nodes in sensor networks. In their approach, a sensor is considered faulty if it deviates significantly from the median of readings of neighboring sensors. In the fault detection scheme presented by Chessa et al. [17], a fault-free initiator starts the diagnosis process by accumulating information from its neighbors and the process continues until all the faulty nodes are identified. However, authors have considered no fault types other than crash faults. A self-monitoring fault detection model has been proposed by Chihfan et al. [18]. But, the network requires some initial self-configuration, and this model is constrained with static sensors only. Chen et al. [19] have discussed a comparison based distributed diagnosis protocol for WSNs. This scheme is developed on the basis of the comparison results of own sensed data and neighbor’s data. However, the scheme suffers from high communication complexity and hence, not energy efficient. Khilar et al. [20] have presented a probabilistic approach to diagnose faulty sensors in intermittent fault environment. Nevertheless, the scheme seems to be complex in terms of diagnosis time, message exchanges, and more importantly energy consumption. For faulty sensor identification considering transient faults (recurring transient faults are called intermittent faults), a comparison based method that uses time redundancy have been discussed by Lee and Choi [21]. In their algorithm, sensor readings of neighboring sensors are compared for \(r\) consecutive rounds. If the readings of two neighboring sensors, say \(S_i\) and \(S_j\), match for minimum \(\theta _2\) rounds then the final comparison outcome, say \(c_{ij}\), is 0 (pass); otherwise \(c_{ij}\) is 1 (fail). To determine the actual status of any node \(S_i\), a majority voting is followed among \(c_{ij}\); \(\forall S_j \in N_{S_i}^t\). Node \(S_i\) is considered fault-free if \(|c_i| \ge \theta _1\), where \(|c_i|\) is the number of 0’s in \(c_{ij}\). Their approach can be used to classify the faulty nodes into permanently or intermittently faulty, based on the number of rounds a node had generated similar sensor readings as its neighbors. However, considering a fixed \(\theta _1\) is not promising and may affect the accuracy. For instance, if \(\theta _1=3\) and a fault-free node has two neighbors then the node is not detected as fault-free, even in case both the neighbors are fault-free. In contrast, an adaptive threshold for each node would be more appropriate and can improve the accuracy. Some more fault management schemes are briefed in the survey [22].

In this paper, we present an efficient Fault Diagnosis Algorithm (FDA) for static topology WSNs in presence of permanently and intermittently soft faults, along with hard faults. The rest of the paper is organized as follows. Section 2 describes the network and fault model for WSNs. The proposed FDA is presented in Sect. 3, and the analytical proofs of correctness and completeness of the algorithm are shown in Sect. 4. Simulation results are discussed in Sect. 5 to support the effectiveness of the algorithm. In Sect. 6, we give a concluding remark of the work.

2 Network Model

We consider a WSN, consisting of a finite collection (say n numbers) of sensor nodes randomly deployed across a geographical area to monitor the environment. The sensor nodes are assumed to be homogeneous and stationary, and the transmission range for all sensor nodes are the same. By homogeneous, we mean sensors have the same processing power and initial energy. Each sensor has a unique identity and they communicate via a multi-hop packet radio network. At the time of deployment the nodes are assumed to be healthy i.e. fault-free. Each sensor node has the knowledge of the identity of its 1-hop neighbors along with their neighborhood information. This is a feasible requirement in sensor networks [2325].

2.1 Fault Model

After deployment, a sensor node at any point of time can be in one of the possible binary states: faulty (0) or fault-free (1). A permanently faulty node does not change its state until it is repaired and/or replaced. Either it communicates with faulty behavior at all times, or it can’t communicate at all. Apart from permanent fault, there exist two variants of faults based on the duration for which they persist; transient and intermittent faults. A transient fault occurs and perishes suddenly without any apparent intervention, whereas the intermittent fault recurs itself irregularly. Due to the unpredictable behaviors of transient and intermittent faults, they are hard to diagnose and handle as compared to permanent faults. The proposed FDA eyes on the detection of nodes with following fault types:

  • permanent or intermittent faults in sensors

  • permanent or intermittent fault in communication unit.

Sensor nodes with permanently faulty communication unit can be detected with conventional time-out mechanism, and are to be excluded from the network. However, the nodes with malfunctioning sensors still remain associated with the network since they have the ability to relay data packets among the nodes. Therefore, our main focus is to detect the permanent faults in sensors and intermittent faults in sensors or communication units.

Let the probability of a node being faulty is \(p_f\) and that of a faulty node having fault-free communication unit is \(p_f^+\). Hence, the number of faulty nodes having faulty communication units is \(n.p_f.(1-p_f^+)\). This leaves \(n'=n-n.p_f.(1-p_f^+)\) nodes that actually take part in fault diagnosis. Out of them, however, \(n.p_f.p_f^+\) nodes are faulty, even though they act as relay nodes. Now the probability of a node being faulty after excluding the nodes that cannot communicate at all can be put as follows,

$$\begin{aligned} \hat{p}_f = \frac{n.p_f.p_f^+}{n-n.p_f.(1-p_f^+)} = \frac{p_f.p_f^+}{1-p_f.(1-p_f^+)}. \end{aligned}$$
(1)

2.2 Communication Framework

The undirected graph \(C=(S,L^t)\), where \(S\) is the set of sensor nodes and \(L^t\) denotes the set of logical links between sensors at any given time \(t\), represents the communication graph or topology of sensor network at time \(t\). Henceforth, we use the superscript \(t\) to quantify the attributes at a given time \(t\); however, they may be removed from the notations when there is no explicit requirement of notion of time. Let (\(S_{(i,x)}^t,S_{(i,y)}^t\)) represent the Cartesian coordinates of the node \(S_i\). Two sensor nodes; \(S_i\) and \(S_j\) are said to be adjacent or 1-hop neighbors, iff the Euclidean distance between those two sensors,

$$\begin{aligned} d_{(S_i,S_j)}^t=\sqrt{(S_{(i,x)}^t-S_{(j,x)}^t)^2+(S_{(i, y)}^t-S_{(j,y)}^t)^2} \end{aligned}$$
(2)

does not exceed the communication range, \(r_c\) i.e.

$$\begin{aligned} l_{(S_i,S_j)}^t \in L^t \Longleftrightarrow d_{(S_i,S_j)}^t \le r_c. \end{aligned}$$
(3)

Since the links in the communication graph are undirected, we have

$$\begin{aligned} l_{(S_i,S_j)}^t \in L^t \Longleftrightarrow l_{(S_j,S_i)}^t \in L^t. \end{aligned}$$
(4)

The set of nodes adjacent to \(S_i\) at time \(t\), called the neighborhood set of \(S_i\) is denoted as \(N_{S_i}^t\), and can be defined as follows,

$$\begin{aligned} N_{S_i}^t = \{S_j | S_j \in S ~and~ l_{(S_i,S_j)}^t \in L^t\}. \end{aligned}$$
(5)

A test graph, \(T=(S',{L'}^t)\) can be constructed from the communication graph by excluding the nodes with permanently faulty communication units and the links associated with those nodes. So \(S' \subseteq S \), \({L'}^t \subseteq L^t\), and T is a sub-graph of C. Each link, \(l_{(S_i,S_j)}^t \in {L'}^t \) is labelled by a binary value \(c_{(S_i,S_j)}^t\). The number of faulty neighbors for any node \(S_i \in S'\) is upper bounded by (\(\lceil |N_{S_i}^t| /2 \rceil -1\)). This is an implicit requirement for a conventional majority voting protocol [19, 26, 27] which is followed in the proposed algorithm.

3 Proposed Fault Diagnosis Algorithm

In this section, we describe the proposed fault diagnosis algorithm for WSNs. The algorithm executes in three phases. In the first phase (comparison phase), each fault-free node compares its sensor reading and residual energy level with that of the neighboring sensor nodes to classify them as fault-free, intermittently faulty, or permanently faulty.

In order to have global view of the network at each fault-free node, the local diagnostic views are to be disseminated. Flooding based approach is a trivial information exchange approach in WSNs. However, the biggest deficiency of this approach that makes it unsuitable for WSNs is message implosion problem [28, 29]. To overcome this, several efficient flooding mechanisms and improvements have been proposed [3043]. Threshold based flooding improvements (counter-based, distance-based, and location-based) are discussed in [30, 31], where nodes decide whether to rebroadcast the flooding packet or not, depending on certain thresholds. In self-pruning, dominant pruning [32] and neighbor-coverage scheme [31], a node relays received packet only if it has neighbors not covered by previous forwarding nodes. Multipoint Relay (MPR) scheme [3337] improves flooding efficiency by choosing a minimal set of neighbors, that covers all 2-hop neighbors of a node, for flooding. In [38], authors have discussed a Topology Broadcast based on Reverse-Path Forwarding (TBRPF) scheme where each node generates spanning tree (also called source tree) providing paths to all reachable nodes. Only the nonleaf nodes in the source tree, where the root is the source node, rebroadcasts the flood packet. However, maintenance of multiple source trees is complex. In cluster based flooding schemes [30], only the cluster heads and gateway nodes act as forwarding nodes. Ogier et al. [39] have discussed a Connected Dominating Set (CDS) based flooding in which MANET Designated Routers (MDRs) are responsible for flooding link state advertisements (LSAs). Adjacencies are formed only between MDRs or Backup MDRs and a subset of their neighbors to reduce flooding overhead. Some MDRs serve as Backup MDRs which provide flooding when neighboring MDRs fail. The set of MDRs forms a connected dominating set, and the set of MDRs and Backup MDRs forms a biconnected dominating set. Baccelli et al. [40] have discussed an overlay subgraph based on a Relative Neighbor Graph (RNG) scheme called Synchronized Link Overlay-Triangular (SLOT), with the objective to have low overlay link density, and low overlay link change rates. Authors have claimed SLOT to be efficient with respect to control traffic; thus useful in dense networks. Cordero et al. [41] have discussed the impact of jitter on flooding that reduces the number of packet collisions and the number of transmissions, but with increased delay. Although these approaches are improvements to blind flooding technique, but would incur more communication cost to achieve the objective of the proposed algorithm, i.e., to disseminate the local diagnostic views for generating global view at each fault-free node in the network. Another efficient dissemination scheme called RPL (Routing Protocol for Low-Power and Lossy Networks) is the key discussion in [42]. RPL organizes the topology as a set of one or more Destination Oriented Directed Acyclic Graphs (DODAGs) that are used for message dissemination.

The generation, and dissemination of global diagnosis view among fault-free nodes in the network requires a connected topology among them. This is possible with minimum \((n-1)\) number of edges, where \(n\) is the number of fault-free nodes. Therefore, we adapt a spanning tree (ST) based dissemination mechanism [17], similar to DODAGs in RPL, to have less communication complexity for dissemination. A distributed ST, rooted at a robust node (initiator), that overlays all fault-free nodes in the network is constructed during building phase. Here, the term distributed means no sensor has the complete ST information, rather each node in the tree keeps the information about its parent and the list of children.

In the last phase (dissemination phase), the ST is used to exchange the local and global diagnostics among the nodes. We discuss these phases in sequel.

3.1 Comparison Phase

Considering spatial correlation in sensor networks, the measurement difference of any two fault-free neighboring sensors is presumed to be very small. However, if at least one of them is faulty then the difference is significant. Hence, if \(x_{S_{i}}^{t}\) is the sensor measurement of node \(S_i\) at a given time \(t\) and \(l_{(S_i,S_j)}^t \in {L'}^t\) then

$$\begin{aligned} |x_{S_{i}}^{t} - x_{S_{j}}^{t}| \left\{ \begin{array}{ll} \le \delta _1, &{}{ if }~{ both}~S_i~{ and}~S_j~{ are}~{ fault}\hbox {-}{ free}\\ > \delta _1, &{} { otherwise}. \end{array}\right. \end{aligned}$$
(6)

To aid the diagnosis process the residual energy levels of neighboring sensor nodes are also compared. In a homogeneous sensor network, the nodes in close proximity have similar level of residual energy since they do same duty [44, 45]. Hence, if \(E_{S_i}^t\) represents the residual energy of node \(S_i\) and \(S_j \in N_{S_i}^t\) then

$$\begin{aligned} |E_{S_i}^t - E_{S_j}^t| \left\{ \begin{array}{ll} \le \delta _2, &{} { if} ~{ both}~S_i~{ and}~S_j~{ are}~{ fault}\hbox {-}{ free}\\ > \delta _2, &{} { otherwise}. \end{array}\right. \end{aligned}$$
(7)

In Eqs. (6) and (7), \(\delta _1\) and \(\delta _2\) are two predefined thresholds. These thresholds may vary depending on the application.

The residual energy comparison not only strengthens the diagnosis process, but also helps in detecting the malicious nodes that are compromised by Denial of Service (DoS) attacks, as faulty. This is due to the fact that such malicious nodes show abnormal energy consumption rate [46, 47].

Now, the comparison outcome, \(c_{(S_i,S_j)}^t\), of any two neighboring nodes \(S_i\) and \(S_j\) can be defined as follows

$$\begin{aligned} c_{(S_i,S_j)}^t= \left\{ \begin{array}{ll} 0, &{} if ~ |~x_{S_i}^t - x_{S_j}^t| \le \delta _1 ~{ and}~ |~E_{S_i}^t - E_{S_j}^t| \le \delta _2\\ 1, &{} { otherwise}. \end{array}\right. \end{aligned}$$
(8)

In Eq. (8), \(c_{(S_i,S_j)}^t=0\) signifies both \(S_i\) and \(S_j\) are fault-free. But, if at least one of \(S_i\) and \(S_j\) is faulty then \(c_{(S_i,S_j)}^t=1\).

Each sensor node, \(S_i \in S\) maintains a boolean status array \(StatR_{S_i}[]\) of size \(n\), to store the fault status of all the nodes in the network. Initially all 1-hop neighbors are assumed to be fault-free \((0)\) and the status of all non-neighboring nodes are unknown \((-1)\) as given in Eq. 9.

$$\begin{aligned} StatR_{S_i}[j]= \left\{ \begin{array}{ll} -1, &{} { if} ~S_j \notin N_{S_i}^t\\ 0, &{} { otherwise}. \end{array}\right. \end{aligned}$$
(9)

In each round, upto total of \(r\) rounds, each sensor node \(S_j \in S'\) sends message of type \({\textsc {Data}}\), containing its observed sensor reading and residual energy level, to \(S_i \in N_{S_j}^t\) i.e. \(m_{S_j}=({\textsc {Data}},x_{S_j}^t, E_{S_j}^t)\). Upon receiving such \(m_{S_j}\) from its neighbor \(S_j\), node \(S_i\) performs the threshold test defined in Eq. (8) and increments \(StatR_{S_i}[j]\) by 1, if at least one of the test condition fails. At the end of \(r\) rounds, each sensor finds a local diagnosis view about the 1-hop neighbors. However, these local views may not be correct. Consider the following cases:

Case-I:

Node \(S_i\) may misdiagnose an intermittently faulty neighbor \(S_j\) as fault-free. This case arises if \(S_j\) sends similar sensor measurement and similar residual energy value to \(S_i\) in all rounds.

figure c

To overcome this situation, \(S_i\) follows majority voting among the decisions of all \(S_k \in N_{S_j}^t\) about the fault status of \(S_j\). However, the views of all nodes \(S_k \in (N_{S_j}^t \cap N_{S_j}^t)\) that are already detected as faulty by \(S_i\) can be discarded to reduce the computation overhead. We consider the maximum number of neighbors to which \(S_j\) may send such similar values in all rounds is \(\lceil n^+_{S_j} / 2 \rceil -1\), where \(n^+_{S_j}\) represents the number of fault-free neighbors of \(S_j\). Of course, this necessitates that each node sends the local view to its 2-hop neighbors. We follow conventional flooding for this, at a cost of \(O(n \times d)\), where \(d\) is the average node degree (refer Theorem 3).

So \(S_i\) finds the set of neighbors of \(S_j\) that are not yet detected faulty by \(S_i\). Let this set be denoted as \(NYF_{(S_i,S_j)}^t\) and defined as in Eq. (10).

$$\begin{aligned} NYF_{(S_i,S_j)}^t = \{ S_k | S_k \in N_{S_j}^t ~and~ StatR_{S_i}[k] \le 0\}. \end{aligned}$$
(10)

Let \(NYF0_{(S_i,S_j)}^t \subseteq NYF_{(S_i,S_j)}^t \) represents the set of neighbors of \(S_j\) in \(NYF_{(S_i,S_j)}^t\) that have detected \(S_j\) to be fault-free i.e.

$$\begin{aligned} NYF0_{(S_i,S_j)}^t = \{ S_k | S_k \in NYF_{(S_i,S_j)}^t ~and~ StatR_{S_k}[j] = 0\}. \end{aligned}$$
(11)

Now, \(S_i\) can follow majority voting and update its status array as follows:

$$\begin{aligned} StatR_{S_i}[j]= \left\{ \begin{array}{ll} 0, &{} { if}~ |NYF0_{(S_i,S_j)}^t| \ge \left\lceil \dfrac{1}{2}|NYF_{(S_i,S_j)}^t| \right\rceil ~and~ StatR_{S_i}[j] \ne 0\\ 1, &{} { if}~ |NYF0_{(S_i,S_j)}^t| < \left\lceil \dfrac{1}{2}|NYF_{(S_i,S_j)}^t| \right\rceil ~and~ StatR_{S_i}[j]=0. \end{array}\right. \end{aligned}$$
(12)

In Eq. (12), the value 0 or 1 of \(StatR_{S_i}[j]\) indicates \(S_j\) to be fault-free or intermittently faulty, respectively. Nevertheless, we can consider any positive value less than \(r\) to signify intermittent fault. Additionally, majority voting can cope with packet losses to certain extent. For instance, loss of packets in some rounds from a fault-free neighbor which affects the decision of \(S_i\) can be overridden.

Case-II:

Node \(S_i\) may misdiagnose an intermittently faulty neighbor \(S_j\) as permanently faulty. This case arises if \(S_j\) sends sensor measurement and residual energy value, either or both of which deviates significantly from that of \(S_i\) in all rounds.

To handle this situation and to determine the actual fault type, \(S_i\) follows Eq. (13) if \(StatR_{S_i}[j]>0\). The equation is based on the rationale that a permanently faulty node is always detected as faulty by all neighbors in all rounds.

$$\begin{aligned} StatR_{S_i}[j]= \left\{ \begin{array}{ll} 1,&{\textit{if}} ~\left( \sum _{\begin{array}{c} (S_k \in N_{S_j}^t) \end{array}} StatR_{S_k}[j]\right) \ne r \times |N_{S_j}^t|\\ 2, &{} {\textit{otherwise}}. \end{array}\right. \end{aligned}$$
(13)

The values 1 or 2 of \(StatR_{S_i}[j]\) in Eq. (13) signifies \(S_j\) to be intermittently faulty or permanently faulty, respectively. The steps followed in this phase are more precisely described in Algorithm 1.

3.2 Building Phase

During this phase, messages of type \({\textsc {SpanTree}}\) are exchanged to construct a ST covering all fault-free nodes. This is initiated by a robust node called initiator. In general, a node \(S_j\) transmits \(m_{S_j}=({\textsc {SpanTree}},Parent_{S_j})\). If \(S_j\) is the initiator then \(Parent_{S_j}= \phi \). Node \(S_i\), upon receiving the first \({\textsc {SpanTree}}\) message from \(S_j \in N_{S_i}^t\), verifies from its local view if \(S_j\) is fault-free. After confirmation, \(S_i\) adds \(S_j\) in its children set if \(S_j\) has already chosen \(S_i\) as its parent i.e. \(S_i = Parent_{S_j}\); otherwise, \(S_i\) chooses fault-free \(S_j\) as its parent in the ST, and intimates the same to its neighbors by sending its own \({\textsc {SpanTree}}\) message. These steps are precisely given in Algorithm 2.

figure d

3.3 Dissemination Phase

The objective of the dissemination phase is to generate global diagnostic view at each fault-free node by exchanging their local diagnostics. Once the ST is constructed, all leaf nodes start disseminating their local diagnostics to their parent.

figure e

A mobile node \(S_i\) handles two types of messages during dissemination phase.

  1. (i)

    LocalDiagnostic—After receiving such message from its child \(S_j\) i.e. \(msg=({\textsc {LocalDiagnostic}},StatR_{S_j}[])\), node \(S_i\) updates its local view to include the local view of \(S_j\). Each mobile \(S_i\) maintains a set \(ChildrenSentLD_{S_i}\), to keep track the list of children those have already sent their local diagnostics. \(S_i\) disseminates its local diagnostic only when all its children have sent their local views to it. However, if \(S_i\) is the initiator then it has deduced the global view, and starts disseminating it down the ST.

  2. (ii)

    GlobalDiagnostic—If \(S_i\) receives a GlobalDiagnostic message from its parent \(S_j\) then it updates its status array to incorporate the global view, and subsequently broadcasts it to its children if it is not a leaf node.

Algorithm 3 shows the steps followed in this phase.

4 Analytical Study of Proposed FDA

In this section we follow mathematical analysis to prove the correctness and completeness of the proposed algorithm.

4.1 Proposed FDA Correctness

If every fault-free sensor \(S_i \in S'\) deduces the correct state of all other sensors in the network, then the proposed FDA is said to be correct. This can be defined with respect to correct local diagnosis and correct dissemination of local and global diagnosis information. We define the following lemmas to support the correctness of the proposed FDA.

Lemma 1

Let \(T=(S',{L'}^t)\) be the test graph for a sensor network, where \(S'\) and \({L'}^t\), respectively represents the set of sensor nodes without permanently faulty communication unit and the set of logical links between them at time \(t\). If \(S_j \in S'\) is an intermittently faulty node, then at the end of comparison phase, it is neither misdiagnosed as fault-free nor as permanently faulty by any other sensor node \(S_i \in (N_{S_j}^t \cap FF^t)\); where \(FF^t \subseteq S'\) is the set of fault-free nodes at time \(t\).

Proof

In case, an intermittently faulty sensor node \(S_j \in S'\) sends to \(S_i \in (N_{S_j}^t \cap FF^t)\), similar sensor measurement and similar residual energy value in all rounds; \(S_i\) misdiagnoses \(S_j\) as fault-free. However, this misdiagnosis is resolved once \(S_i\) follows the majority voting defined in Eq. (12), after receiving the local views of all neighbors of \(S_j\). The validation of Eq. (12) can be given as follows.

If \(S_i\) finds any node \(S_k \in N_{S_j}^t\) to be faulty then it discards the decision of \(S_k\) from voting. Hence, the maximum number of neighbors of \(S_j\) that may take part in voting is \(|N_{S_j}^t|\). This case arises when none of the neighbors of \(S_j\) have been detected faulty by \(S_i\). The minimum number of neighbors of \(S_j\) that may take part in voting is \(\lfloor |N_{S_j}^t|/2\rfloor +1\). This case arises when \(S_j\) has maximum number of faulty neighbors i.e. \(\lceil |N_{S_j}^t|/2 \rceil -1\), and all of them are in the neighborhood of \(S_i\) and are correctly detected faulty by \(S_i\). Hence,

$$\begin{aligned} \lfloor |N_{S_j}^t|/2\rfloor +1 \le |NYF_{(S_i,S_j)}^t| \le |N_{S_j}^t|. \end{aligned}$$
(14)

The maximum number of fault-free neighbors that may detect \(S_j\) as fault-free is \(\lceil (\) minimum number of fault-free neighbors of \(S_j )/2 \rceil -1\) i.e.

$$\begin{aligned} |NYF0_{(S_i,S_j)}^t| \le \left\lceil \dfrac{|\lfloor N_{S_j}^t|/2 \rfloor +1}{2}\right\rceil -1. \end{aligned}$$
(15)

Now, consider a worst case scenario where \(|NYF_{(S_i,S_j)}^t|\) is the minimum i.e. \(\lfloor |N_{S_j}^t|/2\rfloor +1\) and \(|NYF0_{(S_i,S_j)}^t|\) is the maximum i.e. \(\left\lceil \dfrac{|\lfloor N_{S_j}^t|/2 \rfloor +1}{2}\right\rceil -1\). The condition \(|NYF0_{(S_i,S_j)}^t| < \left\lceil \dfrac{1}{2}|NYF_{(S_i,S_j)}^t| \right\rceil \) holds. Hence, if \(S_i\) has misdiagnosed \(S_j\) as fault-free i.e. \(StatR_{S_i}[j] = 0\), then it would update \(StatR_{S_i}[j]\) to 1 (line 31, Algorithm 1); thus classifying \(S_j\) correctly as intermittently faulty.

It is also necessary to prove that \(S_i\), after correctly diagnosing the intermittently faulty neighbor \(S_j\), does not override its decision by subsequently following Eq. (13). This can be verified with the following reasoning.

The inequality

$$\begin{aligned} \left( \sum _{\begin{array}{c} (S_k \in N_{S_j}^t) \end{array}} StatR_{S_k}[j]\right) \ne r \times |N_{S_j}^t| \end{aligned}$$

holds, since there are non-zero number of fault-free neighbors of \(S_j\) that have correctly diagnosed \(S_j\) i.e. for some fault-free node \(S_k \in N_{S_j}^t\), \(0 < StatR_{S_k}[j] < r\). Therefore, \(StatR_{S_i}[j]\) will remain unchanged. Hence, each node \(S_i \in (N_{S_j}^t \cap FF^t)\) correctly diagnoses the state of an intermittently faulty neighbor \(S_j\) at the end of the comparison phase. \(\square \)

Lemma 2

Let \(S'\), the set of sensor nodes without permanently faulty communication unit and \({L'}^t\), the set of logical links between them at time \(t\), collectively define an undirected graph \(T=(S',{L'}^t)\), that represents the test graph of a sensor network. Let \(FF^t \subseteq S'\) be the set of fault-free nodes in the network at time \(t\). At the end of the comparison phase, the status of a fault-free node \(S_j\) is correctly detected by each node \(S_i \in (N_{S_j}^t \cap FF^t)\).

Proof

It is clear from Eqs. (6) and (7) that a fault-free node \(S_j \in S'\) can be misclassified as faulty by a node \(S_i \in \{N_{S_j}^t\cap FF^t \}\), if \(S_j\) sends to \(S_i\): its sensor measurement and residual energy value, either or both of which deviates significantly from that of \(S_i\) in at least one round. Nevertheless, by definition, a fault-free node always sends similar sensor measurement and never gives significantly different remaining energy value compared to its neighbors. So \(S_j\) can never fail any of the threshold tests performed by \(S_i\) in Eqs. (6) and (7), in any round. Hence, it is properly classified as fault-free by all \(S_i \in (N_{S_j}^t\cap FF^t)\). Moreover, this decision is not affected by the majority voting. The reasoning is as follows.

For each node \(S_i \in (N_{S_j}^t\cap FF^t)\), the inequality in Eq. (14) holds and the explanation is the same as given in Lemma 1. The minimum number of neighbors, out of those participated in voting, that have correctly identified \(S_j\) as fault-free can be given as in Eq. (16).

$$\begin{aligned} |NYF0_{(S_i,S_j)}^t| \ge \left( \left\lfloor |N_{S_j}^t|/2\right\rfloor +1\right) . \end{aligned}$$
(16)

From Eqs. (14) and (16), it can easily be verified that,

$$\begin{aligned} |NYF0_{(S_i,S_j)}^t| \nless \left\lceil \dfrac{1}{2}|NYF_{(S_i,S_j)}^t| \right\rceil . \end{aligned}$$
(17)

So the majority voting in Eq. (12) would not affect the decision of \(S_i\) about its fault-free neighbor \(S_j\). Furthermore, Eq. (13) is followed by \(S_i\) if \(StatR_{S_i}[j] > 0\). So it cannot alter the value of \(StatR_{S_i}[j]\) in this case. Therefore, a fault-free node \(S_j\) is always diagnosed as fault-free by each node \(S_i \in (N_{S_j}^t \cap FF^t)\) at the end of the comparison phase. \(\square \)

Lemma 3

Let the test graph of a sensor network be represented as \(T=(S',{L'}^t)\); where \(S'\) is the set of sensor nodes having fault-free communication unit and \({L'}^t\) is the set of logical links among them at time \(t\). Let \(FF^t \subseteq S'\) represents the set of fault-free nodes at time \(t\). If a node \(S_j\) is having permanently faulty sensor, then it is neither diagnosed as fault-free nor as intermittently faulty by any node \(S_i \in (N_{S_j}^t \cap FF^t)\).

Proof

By convention, a node \(S_j\) with permanently malfunctioning sensor gives wrong sensor readings to all its neighbors, in all rounds. A node \(S_i \in \{N_{S_j}^t\cap FF^t \}\) increments \(StatR_{S_i}[j]\) by one, upon receiving wrong sensor readings from \(S_j\), in each round. So at the end of \(r\) rounds, \(StatR_{S_i}[j]\) carries the value \(r\). It can easily be seen, as follows, that the majority voting does not alter this decision. The inequality in Eq. (14) holds due the same reasoning as described in Lemma 1, and \(|NYF0_{(S_i,S_j)}^t|=0\). Hence, it is clear that

$$\begin{aligned} |NYF0_{(S_i,S_j)}^t| < \left\lceil \dfrac{1}{2}|NYF_{(S_i,S_j)}^t|\right\rceil , \end{aligned}$$
(18)

since all fault-free neighbors of \(S_j\) have properly identified it and two faulty nodes never give the same incorrect readings. However, Eq. (12) does not alter \(StatR_{S_i}[j]\) since \(StatR_{S_i}[j] \ne 0\), i.e., because \(S_j\) is permanently faulty.

It is also important to validate Eq. (13) for this scenario. For permanently faulty node \(S_j\),

$$\begin{aligned} \left( \sum _{\begin{array}{c} (S_k \in N_{S_j}^t) \end{array}} StatR_{S_k}[j]\right) = r \times |N_{S_j}^t| \end{aligned}$$

holds, due to the fact that each \(S_k \in N_{S_j}^t\), irrespective of whether it is faulty or not, has \(StatR_{S_k}[j] = r\) at the end of diagnosis. So, \(StatR_{S_i}[j]\) is set to a value 2, signifying it as permanently faulty. Hence, it can be concluded that \(S_j\); a node with permanently faulty sensor is always diagnosed correctly by any node \(S_i \in (N_{S_j}^t \cap FF^t)\). \(\square \)

Lemma 4

(Correct Local Diagnosis View). In an undirected graph, \(T=(S',{L'}^t)\) that represents the test graph of a sensor network, where \(S'\) is the set of sensor nodes without permanently faulty communication unit and \({L'}^t\) is the set of logical links among them at time \(t\), let \(FF^t \subseteq S'\) represents the set of fault-free in the network at time \(t\). If a node \(S_j\) is fault-free, it is never misdiagnosed as faulty, and if it is faulty with certain fault type (permanent or intermittent), then it is neither misclassified as fault-free nor as a wrong fault type by any other node \(S_i \in (N_{S_j}^t \cap FF^t)\).

Proof

The proof of Theorem 1 directly follows from Lemma 1, 2, and 3. Referring Lemma 2, if a node \(S_j\) is fault-free, then it is always diagnosed as fault-free by any fault-free node in \(N_{S_j}^t\). Lemma 1 asserts that, if \(S_j\) is an intermittently faulty node then it is never classified as fault-free, nor as permanently faulty by any other node \(S_i \in (N_{S_j}^t \cap FF^t)\). Again, the permanently faulty nodes are always detected as permanently faulty by all fault-free neighbors, that is vouched by Lemma 3. \(\square \)

Lemma 5

Let the test graph of a sensor network be represented as \(T=(S',{L'}^t)\); where \(S'\) is the set of sensor nodes having fault-free communication unit and \({L'}^t\) is the set of logical links among them at time \(t\). At the end of the building phase, the ST contains all and only fault-free nodes in the network.

Proof

It is clear from Lemma 4 that each fault-free node obtains correct local view. In the building phase, \({\textsc {SpanTree}}\) messages from the faulty nodes can be discarded by the fault-free sensors after confirming the same from their local views. This eliminates any faulty node from becoming a member of the ST.

Furthermore, the subgraph generated from T excluding all faulty nodes is connected. So, each fault-free node receives \({\textsc {SpanTree}}\) messages and broadcasts its own, after choosing its parent. Thus, all fault-free nodes become part of the ST. Hence, the ST contains all and only fault-free nodes. \(\square \)

Lemma 6

Let the undirected graph, \(T=(S',{L'}^t)\) represents the test graph of a sensor network, where \(S'\) is the set of sensor nodes without permanently faulty communication unit and \({L'}^t\) is the set of logical links among them at time \(t\). At the end of the dissemination phase, each node in the ST has the global view of the network.

Proof

Given that nodes are able to verify if they are leaves of the ST, all leaves start dissemination of their local views to their parents. A non-leaf node collects the local views of all its children and combines them with its own to generate an aggregated view, which is then transmitted to it parent up in the ST. Once the root node collects the aggregated views of all it children, it then generates the global diagnostic view. The root disseminates the global view to its children, which is then relayed to the nodes in the subsequent levels in the tree. In this way, correct dissemination of the global view is achieved. \(\square \)

Theorem 1

(Proposed FDA Correctness) In an undirected graph, \(T=(S',{L'}^t)\) represents the test graph of a sensor network, where \(S'\) is the set of sensor nodes without permanently faulty communication unit and \({L'}^t\) is the set of logical links among them at time \(t\), let \(FF^t \subseteq S'\) represents the set of fault-free in the network at time \(t\). At the end of the diagnosis, each node \(S_i \in FF^t\) has the consistent global diagnostic view of the network.

Proof

The proof of this theorem follows Lemma 4, 5, and 6. Lemma 4 asserts that the local diagnosis view of any node \(S_i \in FF^t\) is correct. According to Lemma 6, these local views are properly disseminated to the root of the ST. Subsequently the global view generated at the root is disseminated to all nodes in the ST. Moreover, the ST overlays all and only fault-free nodes, which is proved in Lemma 5. Therefore, each node \(S_i \in FF^t\) has the consistent global diagnostic view of the network, at the end of the diagnosis session. \(\square \)

4.2 Proposed FDA Completeness

The proposed diagnosis algorithm is said to be complete if no node in the network is left undiagnosed, and the proof can be given as in Theorem 2.

Theorem 2

(Proof of Completeness) Let the test graph of a sensor network is represented as an undirected graph \(T=(S',{L'}^t)\), where \(S'\) is the set of sensor nodes with good communication unit and \({L'}^t\) is the set of logical links among them at time \(t\). If a node \(S_i \in S'\), then at the end of the fault diagnosis session, \(S_i \in (FF^t \cup F_P^t \cup F_I^t)\) i.e. \(S'=(FF^t \cup F_P^t \cup F_I^t)\), where \(FF^t \subseteq S'\), \(F_P^t \subset S'\), and \(F_I^t \subset S'\) respectively represent the set of fault-free, permanently faulty, and intermittently faulty nodes in the network at time \(t\).

Proof

From Lemma 1 and 3 it is clear that, if a node \(S_i\) is faulty, then at the end of the fault diagnosis session, \(S_i \in F_P^t\) (if \(S_i\) is permanently faulty) or \(S_i \in F_I^t\) (if \(S_i\) is intermittently faulty). Lemma 2 conveys that, if \(S_i\) is fault-free then \(S_i \in FF^t\), at the end of the fault diagnosis session. Hence, it can be concluded that, if \(S_i \in S\) then \(S_i \in (FF^t \cup F_P^t \cup F_I^t)\); thus \(S'=(FF^t \cup F_P^t \cup F_I^t)\) at the end of the fault diagnosis session. \(\square \)

Theorem 3

The communication complexity of the proposed FDA is \(O(n(r+d))\), where \(n\) is the number of nodes in the network, \(r\) is the number of rounds, and \(d\) is the average node degree.

Proof

In the worst case scenario, all the nodes in the network are fault-free. The message complexity of each phase of the algorithm can be found as follows.

  1. (i)

    Comparison phase—Each node sends its sensor measurement and residual energy value for \(r\) rounds. So the total number of \({\textsc {Data}}\) messages is \(n \times r\). Each node broadcasts it local view to its 1-hop and 2-hop neighbors. Total number of such broadcasts is \(n \times (d+1)\), where \(d\) is the average node degree. This stems from the fact that each node broadcasts its local view, and subsequently all 1-hop neighbors broadcast the received local view to make it available at 2-hop neighbors of the sender.

  2. (ii)

    Building phase—Each node sends \({\textsc {SpanTree}}\) message once. Hence, total number of messages exchanged during the construction of the ST is \(n\).

  3. (iii)

    Dissemination phase—Each node, except the initiator, sends the \({\textsc {LocalDiagnostic}}\) message. So, total number of \({\textsc {LocalDiagnostic}}\) message exchanged is \(n-1\). Each node, except the leaves, sends transmits \({\textsc {GlobalDiagnostic}}\) message once. The maximum number of such message exchange is \(n-1\), and this case arises when the depth of the ST is \(n-1\).

Thus, the total number of messages exchanged during diagnosis is \(n(r+d+4)-2\), and the communication complexity is \(O(n(r+d))\). \(\square \)

5 Simulation Analysis

The performance evaluation of the proposed FDA is presented in this section. A set of simulations are performed using the Castalia-3.2 simulator on OMNET++ platform, to study the feasibility of the proposed algorithm. The results are compared with that of the detection algorithm discussed by Lee and Choi [21].

Based on the faulty behavior, the proposed FDA classifies the sensor nodes into three different classes: permanent fault class, intermittent fault class, and fault-free class. The following two performance measures are used as the evaluation metrics.

  • Classification Accuracy (CA): The ratio of the number of nodes classified into a particular class to the total number of nodes of that class. Hence, CA is defined with respect to each of the three classes mentioned above.

  • False Alarm Rate (FAR): The ratio of the sum of the number of faulty nodes classified as fault-free and the number of fault-free nodes classified as faulty to the total number of nodes in the network.

FAR is defined with an observation that the severity of a fault-free node being misdiagnosed as faulty or a faulty node as fault-free is more, as compared to a permanently faulty node being misdiagnosed as intermittently faulty or vice-versa.

The performance of the proposed algorithm depends on many factors viz.

  • \(d\)—the average node degree.

  • \(p_{pf}\)—the probability that a sensor node has a permanently faulty sensor.

  • \(p_{if}\)—the probability that a sensor node has an intermittently faulty sensor.

  • \(p_{fc}\)—the probability that a sensor node has a faulty communication unit.

  • \(\delta _1, \delta _2\)—thresholds used.

The following two simulation scenarios are considered for discussion.

5.1 Simulation Scenario 1

The first simulation scenario is created for a sensor network with 1,000 nodes randomly deployed over \(1{,}000 \times 1{,}000\,\mathrm{m}^2\) area. With proper adjustment of the transmission range (common for all nodes), the desired value of d can be obtained. In the simulation, the sensor nodes are randomly chosen to have permanently faulty sensors with probabilities 0.02, 0.04, 0.06, 0.08, 0.10 and 0.12, respectively. It is also considered that \(p_{if}\) is 150 % of \(p_{pf}\), in each case.

The values of the thresholds \(\delta _1\), and \(\delta _2\) are considered 6 and 4, respectively. In order to evaluate Lee and Choi’s algorithm, the same simulation scenario is considered with \(\theta _1 = \lceil d/2 \rceil \), and \((r-\theta _2) \simeq \hat{\theta }_2 = 2\) as the values of thresholds used in their algorithm. The FDA is run for \(r(=10)\) rounds to handle intermittent faults. The obtained simulation results for CA and FAR are compared as depicted in Figs. 1a, 2a, 3a, and 4a.

Fig. 1
figure 1

Comparison of permanent fault CA with a \(\delta _1=6, \delta _2=2\) and b \(\delta _1=4, \delta _2=2\)

Fig. 2
figure 2

Comparison of intermittent fault CA with a \(\delta _1=6, \delta _2=2\) and b \(\delta _1=4, \delta _2=2\)

Fig. 3
figure 3

Comparison of fault-free CA with a \(\delta _1=6, \delta _2=2\) and b \(\delta _1=4, \delta _2=2\)

Fig. 4
figure 4

Comparison of false alarm rates with a \(\delta _1=6, \delta _2=2\) and b \(\delta _1=4, \delta _2=2\)

5.2 Simulation Scenario 2

The second simulation scenario is created with same fault percentage as in scenario 1. However, the thresholds \(\delta _1\), and \(\delta _2\) are considered to have values 4 and 2, respectively. The values of the parameter in Lee and Choi’s algorithm are the same as in first simulation scenario. Figures 1b, 2b, 3b, and 4b shows the obtained results for CA, and FAR in this scenario.

5.3 Discussion

It can clearly be observed that the CA decreases with lower node degrees. This is because the fault-free sensor nodes may not always form a connected graph for fault diagnosis purpose, in case of sparse networks. In such scenarios, all neighbors of a particular node may be faulty at the same time, leading to misdiagnosis of the node. Such scenarios arise with more counts for low \(d\) and high fault probability, in which case the performance even degrades.

Figure 1 depicts the comparison of classification accuracy for permanently faulty nodes with \(d\) values 6.8, 10.2, and 14.3. In some rounds, if a permanently faulty node produces a sensor measurement that does not differ from the sensor measurements of its fault-free neighbors by a minimum threshold \(\delta _1\), then it is not classified as permanently faulty. Moreover, high value of \(\delta _1\) boosts the occurrence of such scenarios. The additional threshold test on residual energy in the proposed FDA handles such cases and improves the performance.

An intermittently faulty node that generates incorrect sensor measurements in less than or equal to \(\theta _2\) rounds are not classified as intermittently faulty in the fault detection algorithm by Lee and Choi.

Fault-free nodes are diagnosed with better accuracy for high value of \(\delta _1\). If the value of \(\delta _1\) is chosen to be small, and in some rounds a fault-free node generates sensor measurements that differs from the sensor measurements of its neighboring nodes by the minimum threshold \(\delta _1\), then it will not be diagnosed as fault-free by its neighbors. Such miss-classification scenarios are suppressed in the proposed FDA by the additional energy based threshold test.

The comparison of false alarm rates are clearly shown in Fig. 4. As obvious, it is found that with increase in fault probability, FAR increases. Moreover, the FAR is high for high value of \(\delta _1(=4)\) as compared to low value of \(\delta _1=(2)\); with respect to the same value of \(\delta _2(=2)\) in both cases.

The simulation results show that if thresholds are not chosen carefully, then the performance may be the worst. The average node degree, \(d\) must be adjusted to relatively high to have better performance.

6 Conclusions

In this paper, we propose a distributed fault diagnosis algorithm for WSNs, in order to handle sensor nodes having permanently fault sensor or intermittently faulty processing unit. The proposed FDA not only diagnoses the faulty sensor nodes, also classifies those based on their fault types. The algorithm is based two threshold tests: (i) on sensor measurements of neighboring nodes, and (ii) on expected and actual residual energy of the sensor nodes. Two special cases of intermittent faults are handled: One, where an intermittently faulty node sends similar sensor measurement and similar residual energy value to some of its neighbors in all rounds; Another, where at least one of these values are dissimilar in all rounds. Through mathematical analysis the proposed FDA has been proved to be correct and complete. The simulation experimental results show that the algorithm detects and classifies the faulty nodes with high accuracy and low false alarm rate, even in case of high fault probability, by properly choosing the threshold values. In future, endeavour will be made to handle faults in dynamic topology environment.